Rusty Bargain used car sales service is developing an app to attract new customers. In that app, you can quickly find out the market value of your car. You have access to historical data: technical specifications, trim versions, and prices. You need to build the model to determine the value.
Rusty Bargain is interested in:
- the quality of the prediction;
- the speed of the prediction;
- the time required for training
Target = price
Environment Setup & Required Libraries¶
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from catboost import CatBoostRegressor
from xgboost import XGBRegressor
import lightgbm as lgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import time
import gc
Data preparation¶
df = pd.read_csv("/datasets/car_data.csv")
display(df)
| DateCrawled | Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Mileage | RegistrationMonth | FuelType | Brand | NotRepaired | DateCreated | NumberOfPictures | PostalCode | LastSeen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24/03/2016 11:52 | 480 | NaN | 1993 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 24/03/2016 00:00 | 0 | 70435 | 07/04/2016 03:16 |
| 1 | 24/03/2016 10:58 | 18300 | coupe | 2011 | manual | 190 | NaN | 125000 | 5 | gasoline | audi | yes | 24/03/2016 00:00 | 0 | 66954 | 07/04/2016 01:46 |
| 2 | 14/03/2016 12:52 | 9800 | suv | 2004 | auto | 163 | grand | 125000 | 8 | gasoline | jeep | NaN | 14/03/2016 00:00 | 0 | 90480 | 05/04/2016 12:47 |
| 3 | 17/03/2016 16:54 | 1500 | small | 2001 | manual | 75 | golf | 150000 | 6 | petrol | volkswagen | no | 17/03/2016 00:00 | 0 | 91074 | 17/03/2016 17:40 |
| 4 | 31/03/2016 17:25 | 3600 | small | 2008 | manual | 69 | fabia | 90000 | 7 | gasoline | skoda | no | 31/03/2016 00:00 | 0 | 60437 | 06/04/2016 10:17 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354364 | 21/03/2016 09:50 | 0 | NaN | 2005 | manual | 0 | colt | 150000 | 7 | petrol | mitsubishi | yes | 21/03/2016 00:00 | 0 | 2694 | 21/03/2016 10:42 |
| 354365 | 14/03/2016 17:48 | 2200 | NaN | 2005 | NaN | 0 | NaN | 20000 | 1 | NaN | sonstige_autos | NaN | 14/03/2016 00:00 | 0 | 39576 | 06/04/2016 00:46 |
| 354366 | 05/03/2016 19:56 | 1199 | convertible | 2000 | auto | 101 | fortwo | 125000 | 3 | petrol | smart | no | 05/03/2016 00:00 | 0 | 26135 | 11/03/2016 18:17 |
| 354367 | 19/03/2016 18:57 | 9200 | bus | 1996 | manual | 102 | transporter | 150000 | 3 | gasoline | volkswagen | no | 19/03/2016 00:00 | 0 | 87439 | 07/04/2016 07:15 |
| 354368 | 20/03/2016 19:41 | 3400 | wagon | 2002 | manual | 100 | golf | 150000 | 6 | gasoline | volkswagen | NaN | 20/03/2016 00:00 | 0 | 40764 | 24/03/2016 12:45 |
354369 rows × 16 columns
# Inspect dataset
df1 = df.copy()
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354369 entries, 0 to 354368 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 DateCrawled 354369 non-null object 1 Price 354369 non-null int64 2 VehicleType 316879 non-null object 3 RegistrationYear 354369 non-null int64 4 Gearbox 334536 non-null object 5 Power 354369 non-null int64 6 Model 334664 non-null object 7 Mileage 354369 non-null int64 8 RegistrationMonth 354369 non-null int64 9 FuelType 321474 non-null object 10 Brand 354369 non-null object 11 NotRepaired 283215 non-null object 12 DateCreated 354369 non-null object 13 NumberOfPictures 354369 non-null int64 14 PostalCode 354369 non-null int64 15 LastSeen 354369 non-null object dtypes: int64(7), object(9) memory usage: 43.3+ MB
Standardize Columns¶
df.columns = df.columns.str.lower()
display(df.head())
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 24/03/2016 11:52 | 480 | NaN | 1993 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 24/03/2016 00:00 | 0 | 70435 | 07/04/2016 03:16 |
| 1 | 24/03/2016 10:58 | 18300 | coupe | 2011 | manual | 190 | NaN | 125000 | 5 | gasoline | audi | yes | 24/03/2016 00:00 | 0 | 66954 | 07/04/2016 01:46 |
| 2 | 14/03/2016 12:52 | 9800 | suv | 2004 | auto | 163 | grand | 125000 | 8 | gasoline | jeep | NaN | 14/03/2016 00:00 | 0 | 90480 | 05/04/2016 12:47 |
| 3 | 17/03/2016 16:54 | 1500 | small | 2001 | manual | 75 | golf | 150000 | 6 | petrol | volkswagen | no | 17/03/2016 00:00 | 0 | 91074 | 17/03/2016 17:40 |
| 4 | 31/03/2016 17:25 | 3600 | small | 2008 | manual | 69 | fabia | 90000 | 7 | gasoline | skoda | no | 31/03/2016 00:00 | 0 | 60437 | 06/04/2016 10:17 |
df.loc[251638,['model']] = 'wrangler'
Details to Help with Data Cleaning¶
General
- 1769: Steam Wagon (Nicolas-Joseph Cugnot, France) Steam-powered, heavy, experimental — not practical
- 1800s: Steam carriages - Small numbers in UK & France, for private roads
- 1830s–1890s: Electric vehicles - Short-range city vehicles, mostly experimental or low-volume
- The first gasoline car was made as early as 1885
- The first car to receive registration was on August 14th, 1893
Automation of Vehicle History
- 1904: Sturtevant Automatic Automobile
- 1939/1940: Cadillac & Oldsmobile w/ Hydra-Matic by General Motors
- 1941: Buick (military - WWII civilian car production halt (1942)) - Chrysler Fluid Drive / Vacamatic / Prestomatic
- 1948: Buick Roadmaster / Dynaflow (1949)
- 1950: Powerglide by Chevrolet
- 1961: K4A Mercedes-Benz
- most Cadillac, Oldsmobile, Buick, and Chrysler
- 1962 - : Automatics rapidly expanded
First Car by Model (Earliest Registration Year)
- Rover: 1885
- Mercedes-Benz: 1886
- Peugeot: 1889
- Opel: 1899
- Renault: 1899
- Fiat: 1899
- Ford: 1903
- Škoda: 1905
- Lancia: 1906
- Daihatsu: 1907
- Suzuki: 1909
- Audi: 1910
- Alfa Romeo: 1910
- Chevrolet: 1911
- Mitsubishi: 1917
- Citroën: 1919
- Jaguar: 1922
- Chrysler: 1924
- Volvo: 1927
- BMW: 1928
- Mazda: 1931
- Porsche: 1931
- Nissan: 1933
- Toyota: 1936
- Volkswagen: 1937
- Jeep: 1941
- Kia: 1944
- Saab: 1947
- Honda: 1948
- Land Rover: 1948
- SEAT: 1950
- Subaru: 1954
- Trabant: 1957
- Mini: 1959
- Dacia: 1966
- Daewoo: 1967
- Hyundai: 1967
- Lada: 1970
- Smart: 1998
- Sonstige_autos: N/A (miscellaneous)
# Look at years before 1885 and after 2025
df[(df["registrationyear"] > 2025) | (df["registrationyear"] < 1885)]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 622 | 16/03/2016 16:55 | 0 | NaN | 1111 | NaN | 0 | NaN | 5000 | 0 | NaN | opel | NaN | 16/03/2016 00:00 | 0 | 44628 | 20/03/2016 16:44 |
| 12946 | 29/03/2016 18:39 | 49 | NaN | 5000 | NaN | 0 | golf | 5000 | 12 | NaN | volkswagen | NaN | 29/03/2016 00:00 | 0 | 74523 | 06/04/2016 04:16 |
| 15147 | 14/03/2016 00:52 | 0 | NaN | 9999 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 13/03/2016 00:00 | 0 | 32689 | 21/03/2016 23:46 |
| 15870 | 02/04/2016 11:55 | 1700 | NaN | 3200 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 33649 | 06/04/2016 09:46 |
| 16062 | 29/03/2016 23:42 | 190 | NaN | 1000 | NaN | 0 | mondeo | 5000 | 0 | NaN | ford | NaN | 29/03/2016 00:00 | 0 | 47166 | 06/04/2016 10:44 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340548 | 02/04/2016 17:44 | 0 | NaN | 3500 | manual | 75 | NaN | 5000 | 3 | petrol | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 96465 | 04/04/2016 15:17 |
| 340759 | 04/04/2016 23:55 | 700 | NaN | 1600 | manual | 1600 | a3 | 150000 | 4 | petrol | audi | no | 04/04/2016 00:00 | 0 | 86343 | 05/04/2016 06:44 |
| 341791 | 28/03/2016 17:37 | 1 | NaN | 3000 | NaN | 0 | zafira | 5000 | 0 | NaN | opel | NaN | 28/03/2016 00:00 | 0 | 26624 | 02/04/2016 22:17 |
| 348830 | 22/03/2016 00:38 | 1 | NaN | 1000 | NaN | 1000 | NaN | 150000 | 0 | NaN | sonstige_autos | NaN | 21/03/2016 00:00 | 0 | 41472 | 05/04/2016 14:18 |
| 351682 | 12/03/2016 00:57 | 11500 | NaN | 1800 | NaN | 16 | other | 5000 | 6 | petrol | fiat | NaN | 11/03/2016 00:00 | 0 | 16515 | 05/04/2016 19:47 |
171 rows × 16 columns
# First registration reported in 1885; registration dates before this are incorrect
car_dates = df[(df["registrationyear"] > 2025) & (df['model'].isna()) |
(df["registrationyear"] < 1885) & (df['model'].isna())]
car_dates
# Incorrect registration dates need to be marked as Nan
df.loc[(df["registrationyear"] > 2025) & (df['model'].isna()) | (df["registrationyear"] < 1885) &
(df['model'].isna()),["registrationyear"]] = np.nan
First Car by Model (Earliest Registration Year)
- Rover: 1885
- Mercedes-Benz: 1886
- Peugeot: 1889
- Opel: 1899
- Renault: 1899
- Fiat: 1899
- Ford: 1903
- Škoda: 1905
- Lancia: 1906
- Daihatsu: 1907
- Suzuki: 1909
- Audi: 1910
- Alfa Romeo: 1910
- Chevrolet: 1911
- Mitsubishi: 1917
- Citroën: 1919
- Jaguar: 1922
- Chrysler: 1924
- Volvo: 1927
- BMW: 1928
- Mazda: 1931
- Porsche: 1931
- Nissan: 1933
- Toyota: 1936
- Volkswagen: 1937
- Jeep: 1941
- Kia: 1944
- Saab: 1947
- Honda: 1948
- Land Rover: 1948
- SEAT: 1950
- Subaru: 1954
- Trabant: 1957
- Mini: 1959
- Dacia: 1966
- Daewoo: 1967
- Hyundai: 1967
- Lada: 1970
- Smart: 1998
- Sonstige_autos: N/A (miscellaneous)
Brands that do not have registration dates before earliest record
- Lada
- Daewoo
- Dacia
- Mini
- SEAT
- Land Rover
- Honda
- Saab
- Kia
- Nissan
- Porsche
- Mazda
- Jaguar
- Chrysler
- Volvo
- Rover
- Mercedes-Benz
- Peugeot
- Opel
- Renault
- Fiat
- Ford
- Škoda
- Lancia
- Daihatsu
- Suzuki
- Audi
- Alfa Romeo
- Chevrolet
# Look at smart cars registered before 1998
df[(df['brand'] == 'smart') & (df['registrationyear'] < 1998)]
smartnan = (df['brand'] == 'smart') & (df['registrationyear'] < 1998) & (df['model'].isna())
df.loc[smartnan,['registrationyear']] = np.nan
# Hyundai before 1967 implausible
hyundai = (df['brand'] == 'hyundai') & (df['registrationyear'] < 1967) & (df['model'].isna())
df.loc[hyundai, ['registrationyear']] = np.nan
remaining = ['smart', 'hyundai', 'mitsubishi', 'citroen', 'bmw', 'toyota', 'volkswagen', 'jeep', 'subaru', 'trabant']
earliest_years = {'smart': 1998, 'hyundai': 1967,'mitsubishi': 1917, 'citroen': 1919, 'bmw': 1928, 'toyota': 1936,
'volkswagen': 1937, 'jeep': 1941, 'subaru': 1954, 'trabant': 1957}
for brands in remaining:
df.loc[(df['brand'] == brands) & (df['registrationyear'] < earliest_years[brands]) &
df['model'].isna(), ['registrationyear']] = np.nan
display(df[(df['brand'] == brands) & (df['registrationyear'] < earliest_years[brands])])
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 31212 | 12/03/2016 16:45 | 700 | small | 1997.0 | NaN | 0 | forfour | 5000 | 3 | petrol | smart | NaN | 12/03/2016 00:00 | 0 | 88416 | 07/04/2016 06:17 |
| 161667 | 04/04/2016 20:56 | 1650 | small | 1992.0 | auto | 55 | fortwo | 100000 | 7 | petrol | smart | no | 04/04/2016 00:00 | 0 | 28327 | 06/04/2016 23:44 |
| 319739 | 05/04/2016 20:36 | 1650 | small | 1992.0 | NaN | 0 | fortwo | 100000 | 6 | NaN | smart | yes | 05/04/2016 00:00 | 0 | 28327 | 05/04/2016 20:36 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 244840 | 09/03/2016 17:50 | 0 | NaN | 1910.0 | NaN | 0 | other | 5000 | 0 | NaN | hyundai | NaN | 09/03/2016 00:00 | 0 | 59510 | 07/04/2016 10:44 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 154559 | 03/04/2016 12:40 | 0 | small | 1910.0 | manual | 0 | colt | 150000 | 0 | petrol | mitsubishi | NaN | 03/04/2016 00:00 | 0 | 46397 | 07/04/2016 14:57 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 125577 | 15/03/2016 18:38 | 7750 | NaN | 1001.0 | NaN | 0 | other | 5000 | 0 | NaN | citroen | NaN | 15/03/2016 00:00 | 0 | 66706 | 06/04/2016 18:47 |
| 270911 | 23/03/2016 11:48 | 0 | other | 1910.0 | manual | 0 | other | 5000 | 0 | petrol | citroen | no | 23/03/2016 00:00 | 0 | 98630 | 23/03/2016 11:48 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 58883 | 15/03/2016 21:57 | 1 | NaN | 1910.0 | NaN | 0 | 3er | 150000 | 0 | NaN | bmw | NaN | 15/03/2016 00:00 | 0 | 74074 | 07/04/2016 07:17 |
| 119442 | 18/03/2016 10:37 | 1 | NaN | 1000.0 | NaN | 1000 | 3er | 5000 | 0 | NaN | bmw | NaN | 18/03/2016 00:00 | 0 | 94086 | 05/04/2016 22:16 |
| 203230 | 01/04/2016 15:37 | 400 | NaN | 1910.0 | manual | 170 | 3er | 5000 | 0 | NaN | bmw | NaN | 01/04/2016 00:00 | 0 | 66333 | 03/04/2016 11:48 |
| 213499 | 08/03/2016 12:06 | 380 | NaN | 1000.0 | NaN | 0 | 6er | 5000 | 0 | NaN | bmw | NaN | 08/03/2016 00:00 | 0 | 35102 | 06/04/2016 00:16 |
| 287304 | 09/03/2016 15:54 | 500 | NaN | 1602.0 | manual | 0 | other | 5000 | 0 | NaN | bmw | yes | 09/03/2016 00:00 | 0 | 30900 | 10/03/2016 12:17 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen |
|---|
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 23750 | 16/03/2016 19:58 | 3900 | wagon | 1910.0 | manual | 90 | passat | 150000 | 0 | petrol | volkswagen | NaN | 16/03/2016 00:00 | 0 | 88662 | 07/04/2016 05:45 |
| 35943 | 19/03/2016 10:57 | 200 | other | 1910.0 | NaN | 0 | caddy | 150000 | 0 | gasoline | volkswagen | NaN | 19/03/2016 00:00 | 0 | 35096 | 20/03/2016 18:10 |
| 40133 | 23/03/2016 18:00 | 0 | NaN | 1910.0 | NaN | 0 | other | 5000 | 0 | NaN | volkswagen | NaN | 23/03/2016 00:00 | 0 | 85045 | 23/03/2016 18:41 |
| 53577 | 20/03/2016 11:44 | 330 | NaN | 1000.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 20/03/2016 00:00 | 0 | 45259 | 04/04/2016 08:17 |
| 56241 | 30/03/2016 18:54 | 950 | NaN | 1400.0 | manual | 1400 | golf | 125000 | 4 | petrol | volkswagen | NaN | 30/03/2016 00:00 | 0 | 50389 | 03/04/2016 09:45 |
| 62803 | 07/03/2016 22:58 | 3400 | small | 1910.0 | manual | 90 | beetle | 90000 | 4 | NaN | volkswagen | no | 07/03/2016 00:00 | 0 | 34308 | 12/03/2016 08:16 |
| 71459 | 27/03/2016 23:46 | 500 | NaN | 1000.0 | NaN | 0 | golf | 5000 | 0 | NaN | volkswagen | NaN | 27/03/2016 00:00 | 0 | 91628 | 29/03/2016 13:46 |
| 74814 | 21/03/2016 12:52 | 400 | NaN | 1910.0 | NaN | 60 | golf | 150000 | 0 | petrol | volkswagen | NaN | 21/03/2016 00:00 | 0 | 29462 | 25/03/2016 09:17 |
| 143621 | 17/03/2016 23:40 | 550 | NaN | 1000.0 | NaN | 1000 | golf | 5000 | 6 | petrol | volkswagen | NaN | 17/03/2016 00:00 | 0 | 91732 | 26/03/2016 05:18 |
| 144388 | 09/03/2016 20:52 | 50 | NaN | 1910.0 | NaN | 0 | kaefer | 5000 | 0 | NaN | volkswagen | NaN | 09/03/2016 00:00 | 0 | 50374 | 05/04/2016 18:46 |
| 147663 | 03/04/2016 19:37 | 0 | NaN | 1910.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 03/04/2016 00:00 | 0 | 2826 | 05/04/2016 20:15 |
| 151280 | 05/04/2016 00:39 | 300 | NaN | 1910.0 | manual | 0 | golf | 150000 | 0 | petrol | volkswagen | NaN | 04/04/2016 00:00 | 0 | 89269 | 05/04/2016 05:42 |
| 164397 | 29/03/2016 17:49 | 0 | NaN | 1000.0 | NaN | 0 | transporter | 5000 | 1 | NaN | volkswagen | NaN | 29/03/2016 00:00 | 0 | 29351 | 06/04/2016 03:45 |
| 174893 | 05/03/2016 19:48 | 0 | NaN | 1000.0 | NaN | 1000 | golf | 5000 | 4 | petrol | volkswagen | NaN | 05/03/2016 00:00 | 0 | 35716 | 05/03/2016 22:27 |
| 183727 | 03/04/2016 12:48 | 0 | bus | 1910.0 | NaN | 0 | transporter | 5000 | 0 | NaN | volkswagen | NaN | 03/04/2016 00:00 | 0 | 84478 | 03/04/2016 12:48 |
| 189722 | 29/03/2016 16:56 | 1500 | NaN | 1000.0 | manual | 0 | kaefer | 5000 | 0 | petrol | volkswagen | NaN | 29/03/2016 00:00 | 0 | 48324 | 31/03/2016 10:15 |
| 203985 | 07/03/2016 14:53 | 222 | NaN | 1910.0 | manual | 0 | golf | 5000 | 0 | petrol | volkswagen | NaN | 07/03/2016 00:00 | 0 | 26802 | 12/03/2016 04:15 |
| 218241 | 16/03/2016 12:46 | 7999 | NaN | 1800.0 | NaN | 290 | golf | 5000 | 6 | NaN | volkswagen | NaN | 16/03/2016 00:00 | 0 | 15827 | 29/03/2016 20:47 |
| 256532 | 05/03/2016 17:44 | 12500 | NaN | 1000.0 | NaN | 200 | golf | 5000 | 0 | NaN | volkswagen | NaN | 28/02/2016 00:00 | 0 | 75378 | 07/04/2016 12:17 |
| 276318 | 31/03/2016 14:58 | 300 | NaN | 1910.0 | NaN | 0 | polo | 5000 | 0 | NaN | volkswagen | NaN | 31/03/2016 00:00 | 0 | 53902 | 06/04/2016 08:16 |
| 286928 | 18/03/2016 16:51 | 1 | NaN | 1000.0 | NaN | 174 | touareg | 5000 | 3 | gasoline | volkswagen | NaN | 18/03/2016 00:00 | 0 | 97616 | 05/04/2016 22:44 |
| 318111 | 25/03/2016 13:42 | 1 | NaN | 1910.0 | NaN | 0 | golf | 125000 | 0 | NaN | volkswagen | NaN | 25/03/2016 00:00 | 0 | 54295 | 06/04/2016 15:44 |
| 318501 | 02/04/2016 13:57 | 0 | NaN | 1910.0 | NaN | 0 | caddy | 5000 | 0 | NaN | volkswagen | NaN | 02/04/2016 00:00 | 0 | 16949 | 06/04/2016 12:16 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen |
|---|
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18224 | 09/03/2016 17:49 | 7999 | NaN | 1500.0 | manual | 224 | impreza | 5000 | 3 | NaN | subaru | NaN | 09/03/2016 00:00 | 0 | 53577 | 15/03/2016 05:15 |
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 199563 | 09/03/2016 20:37 | 60 | wagon | 1956.0 | NaN | 0 | other | 150000 | 0 | NaN | trabant | NaN | 09/03/2016 00:00 | 0 | 16775 | 05/04/2016 16:45 |
| 294028 | 28/03/2016 23:45 | 0 | NaN | 1111.0 | NaN | 0 | 601 | 5000 | 0 | NaN | trabant | NaN | 28/03/2016 00:00 | 0 | 6712 | 30/03/2016 16:45 |
# Kaefer's is the german name for beetle - they are the same car
beetle = df['model'] == 'kaefer'
df.loc[beetle, ['model']] = 'beetle'
del smartnan
del hyundai
del beetle
gc.collect()
4
df['registrationyear'] = pd.to_numeric(df['registrationyear'], errors = 'coerce')
# Define models and earliest years
model_cols = ['fortwo', 'forfour', 'colt', '3er', '6er', 'passat', 'caddy',
'polo', 'golf', 'beetle', 'transporter', 'touareg',
'impreza', '601']
earliest_registration = {
'fortwo': 1998,
'forfour': 2004,
'colt': 1962,
'3er': 1975,
'6er': 1976,
'passat': 1973,
'caddy': 1980,
'polo': 1975,
'golf': 1974,
'beetle': 1938,
'transporter': 1950,
'touareg': 2002,
'impreza': 1992,
'601': 1964
}
latest_registration = {
'fortwo': 2025,
'forfour': 2025,
'colt': 2013,
'3er': 2025,
'6er': 2025,
'passat': 2025,
'caddy': 2025,
'polo': 2025,
'golf': 2025,
'beetle': 2019,
'transporter': 2025,
'touareg': 2025,
'impreza': 2025,
'601': 1991
}
df['registration_correction'] = np.nan
for model in model_cols:
# Too early
too_early = (df['model'] == model) & (df['registrationyear'] < earliest_registration[model])
df.loc[too_early, ['registration_correction']] = "Y: too early"
# Too late
too_late = (df['model'] == model) & (df['registrationyear'] > latest_registration[model])
df.loc[too_late, ['registration_correction']] = "Y: too late"
# Missing
missing = df['registrationyear'].isna()
df.loc[missing,['registration_correction']] = "Y: missing"
# acceptable registration
acceptable = (df['model'] == model) & ((df['registrationyear'] >= earliest_registration[model]) & (df['registrationyear'] <= latest_registration[model]))
df.loc[acceptable,['registration_correction']] = 'N'
else:
np.nan
display(df['registration_correction'].isna().sum())
registration_years = {
'corsa': {'earliest': 1982, 'latest': 2025},
'astra': {'earliest': 1991, 'latest': 2025},
'passat': {'earliest': 1973, 'latest': 2025},
'a4': {'earliest': 1994, 'latest': 2025},
'c_klasse': {'earliest': 1993, 'latest': 2025},
'5er': {'earliest': 1972, 'latest': 2025},
'e_klasse': {'earliest': 1993, 'latest': 2025},
'a3': {'earliest': 1996, 'latest': 2025},
'focus': {'earliest': 1998, 'latest': 2025},
'fiesta': {'earliest': 1976, 'latest': 2025},
'a6': {'earliest': 1994, 'latest': 2025},
'twingo': {'earliest': 1993, 'latest': 2025},
'transporter': {'earliest': 1950, 'latest': 2025},
'2_reihe': {'earliest': 1982, 'latest': 2025},
'vectra': {'earliest': 1988, 'latest': 2008},
'a_klasse': {'earliest': 1997, 'latest': 2025},
'mondeo': {'earliest': 1993, 'latest': 2025},
'clio': {'earliest': 1991, 'latest': 2025},
'1er': {'earliest': 2004, 'latest': 2025},
'3_reihe': {'earliest': 1982, 'latest': 2025},
'touran': {'earliest': 2003, 'latest': 2025},
'punto': {'earliest': 1993, 'latest': 2025},
'zafira': {'earliest': 1999, 'latest': 2025},
'megane': {'earliest': 1995, 'latest': 2025},
'ibiza': {'earliest': 1984, 'latest': 2025},
'ka': {'earliest': 1996, 'latest': 2025},
'lupo': {'earliest': 1998, 'latest': 2005},
'octavia': {'earliest': 1996, 'latest': 2025},
'fabia': {'earliest': 1999, 'latest': 2025},
'cooper': {'earliest': 2001, 'latest': 2025},
'clk': {'earliest': 1997, 'latest': 2010},
'micra': {'earliest': 1982, 'latest': 2025},
'80': {'earliest': 1972, 'latest': 1996},
'caddy': {'earliest': 1980, 'latest': 2025},
'x_reihe': {'earliest': 2000, 'latest': 2025},
'sharan': {'earliest': 1995, 'latest': 2025},
'scenic': {'earliest': 1996, 'latest': 2025},
'omega': {'earliest': 1986, 'latest': 2003},
'laguna': {'earliest': 1994, 'latest': 2025},
'civic': {'earliest': 1972, 'latest': 2025},
'1_reihe': {'earliest': 1970, 'latest': 2025},
'leon': {'earliest': 1999, 'latest': 2025},
'6_reihe': {'earliest': 2003, 'latest': 2025},
'i_reihe': {'earliest': 2004, 'latest': 2025},
'slk': {'earliest': 1996, 'latest': 2025},
'galaxy': {'earliest': 1959, 'latest': 2025},
'tt': {'earliest': 1998, 'latest': 2025},
'meriva': {'earliest': 2003, 'latest': 2025},
'yaris': {'earliest': 1999, 'latest': 2025},
'7er': {'earliest': 1977, 'latest': 2025},
'mx_reihe': {'earliest': 1989, 'latest': 2025},
'kangoo': {'earliest': 1997, 'latest': 2025},
'm_klasse': {'earliest': 1997, 'latest': 2025},
'500': {'earliest': 1957, 'latest': 2025},
'escort': {'earliest': 1968, 'latest': 2000},
'arosa': {'earliest': 1997, 'latest': 2005},
'one': {'earliest': 2001, 'latest': 2025},
's_klasse': {'earliest': 1972, 'latest': 2025},
'vito': {'earliest': 1996, 'latest': 2025},
'b_klasse': {'earliest': 2005, 'latest': 2025},
'bora': {'earliest': 1998, 'latest': 2005},
'berlingo': {'earliest': 1996, 'latest': 2025},
'tigra': {'earliest': 1994, 'latest': 2008},
'v40': {'earliest': 1995, 'latest': 2025},
'sprinter': {'earliest': 1995, 'latest': 2025},
'transit': {'earliest': 1965, 'latest': 2025},
'fox': {'earliest': 2003, 'latest': 2025},
'z_reihe': {'earliest': 1998, 'latest': 2025},
'swift': {'earliest': 1983, 'latest': 2025},
'c_max': {'earliest': 2003, 'latest': 2025},
'corolla': {'earliest': 1966, 'latest': 2025},
'panda': {'earliest': 1980, 'latest': 2025},
'seicento': {'earliest': 1998, 'latest': 2007},
'tiguan': {'earliest': 2007, 'latest': 2025},
'insignia': {'earliest': 2008, 'latest': 2025},
'4_reihe': {'earliest': 1892, 'latest': 2025},
'v70': {'earliest': 1997, 'latest': 2025},
'156': {'earliest': 1997, 'latest': 2005},
'primera': {'earliest': 1990, 'latest': 2007},
'espace': {'earliest': 1984, 'latest': 2025},
'scirocco': {'earliest': 1974, 'latest': 2017},
'stilo': {'earliest': 2001, 'latest': 2008},
'a1': {'earliest': 2010, 'latest': 2025},
'almera': {'earliest': 1995, 'latest': 2006},
'147': {'earliest': 2000, 'latest': 2010},
'avensis': {'earliest': 1997, 'latest': 2025},
'grand': {'earliest': 1924, 'latest': 2025},
'a5': {'earliest': 2007, 'latest': 2025},
'qashqai': {'earliest': 2006, 'latest': 2025},
'a8': {'earliest': 1994, 'latest': 2025},
'eos': {'earliest': 2006, 'latest': 2025},
'c3': {'earliest': 2002, 'latest': 2025}
}
registration_cols = list(registration_years.keys())
for registration in registration_cols:
earliest = registration_years[registration]['earliest']
latest = registration_years[registration]['latest']
# Too early
early_reg = (df['model'] == registration) & (df['registrationyear'] < earliest)
df.loc[early_reg,['registration_correction']] = "Y: too early"
# Too late
late_reg = (df['model'] == registration) & (df['registrationyear'] > latest)
df.loc[late_reg,['registration_correction']] = "Y: too late"
# Acceptable range
acc_reg = (df['model'] == registration) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
df.loc[acc_reg,['registration_correction']] = 'N'
df['registration_correction'].isna().sum()
267415
73063
df['registration_correction'].value_counts(dropna = False)
N 278017 NaN 73063 Y: too early 1609 Y: too late 1583 Y: missing 97 Name: registration_correction, dtype: int64
remainder_models = df[(df['registration_correction'].isna()) & (df['model'].notna())]
remainder_models['model'].unique()
array(['other', 'navara', 'c4', 'kadett', 'signum', 'jetta', 'forester',
'xc_reihe', 'combo', 'jazz', '100', 'sportage', 'sorento',
'mustang', 'getz', 'r19', 'cordoba', 'up', 'ceed', '5_reihe',
'yeti', 'mii', 'rx_reihe', 'modus', 'matiz', 'c1', 'rio', 'logan',
'spider', 'cuore', 's_max', 'a2', 'viano', 'roomster', 'sl',
'santa', 'ptcruiser', 'exeo', '159', 'juke', 'carisma', 'accord',
'lanos', 'phaeton', 'verso', 'rav', 'picanto', 'boxster', 'kalos',
'superb', 'alhambra', 'roadster', 'ypsilon', 'cayenne', 'galant',
'justy', '90', 'sirion', 'crossfire', 'agila', 'duster',
'cr_reihe', 'v50', 'c_reihe', 'v_klasse', 'c5', 'aygo', 'cc',
'carnival', 'fusion', '911', 'm_reihe', 'cl', '300c', 'spark',
'kuga', 'x_type', 'ducato', 's_type', 'x_trail', 'toledo', 'altea',
'voyager', 'calibra', 'bravo', 'antara', 'tucson', 'citigo',
'jimny', 'wrangler', 'lybra', 'q7', 'lancer', 'captiva', 'c2',
'discovery', 'freelander', 'sandero', 'note', '900', 'cherokee',
'clubman', 'samara', 'defender', 'cx_reihe', 'legacy', 'pajero',
'auris', 'niva', 's60', 'nubira', 'vivaro', 'g_klasse', 'lodgy',
'850', 'range_rover', 'q3', 'serie_2', 'glk', 'charade', 'croma',
'outlander', 'doblo', 'musa', 'move', '9000', 'v60', '145', 'aveo',
'200', 'b_max', 'range_rover_sport', 'terios', 'rangerover', 'q5',
'range_rover_evoque', 'materia', 'delta', 'gl', 'kalina', 'amarok',
'elefantino', 'i3', 'kappa', 'serie_3', 'serie_1'], dtype=object)
reg_cols = ['navara', 'c4', 'kadett', 'signum', 'jetta', 'forester',
'xc_reihe', 'combo', 'jazz', '100', 'sportage', 'sorento',
'mustang', 'getz', 'r19', 'cordoba', 'up', 'ceed', '5_reihe',
'yeti', 'mii', 'rx_reihe', 'modus', 'matiz', 'c1', 'rio', 'logan',
'spider', 'cuore', 's_max', 'a2', 'viano', 'roomster', 'sl',
'santa', 'ptcruiser', 'exeo', '159', 'juke', 'carisma', 'accord',
'lanos', 'phaeton', 'verso', 'rav', 'picanto', 'boxster', 'kalos',
'superb', 'alhambra']
reg_years = {
'navara': {'earliest': 1997, 'latest': 2025},
'c4': {'earliest': 2004, 'latest': 2025},
'kadett': {'earliest': 1937, 'latest': 1991},
'signum': {'earliest': 2003, 'latest': 2008},
'jetta': {'earliest': 1979, 'latest': 2018},
'forester': {'earliest': 1997, 'latest': 2025},
'xc_reihe': {'earliest': 2001, 'latest': 2025},
'combo': {'earliest': 1993, 'latest': 2025},
'jazz': {'earliest': 2001, 'latest': 2025},
'100': {'earliest': 1968, 'latest': 1994},
'sportage': {'earliest': 1993, 'latest': 2025},
'sorento': {'earliest': 2002, 'latest': 2025},
'mustang': {'earliest': 1964, 'latest': 2025},
'getz': {'earliest': 2002, 'latest': 2011},
'r19': {'earliest': 1988, 'latest': 1996},
'cordoba': {'earliest': 1993, 'latest': 2009},
'up': {'earliest': 2011, 'latest': 2025},
'ceed': {'earliest': 2006, 'latest': 2025},
'5_reihe': {'earliest': 1972, 'latest': 2025},
'yeti': {'earliest': 2009, 'latest': 2017},
'mii': {'earliest': 2011, 'latest': 2025},
'rx_reihe': {'earliest': 1978, 'latest': 2012},
'modus': {'earliest': 2004, 'latest': 2012},
'matiz': {'earliest': 1998, 'latest': 2018},
'c1': {'earliest': 2005, 'latest': 2025},
'rio': {'earliest': 2000, 'latest': 2025},
'logan': {'earliest': 2004, 'latest': 2025},
'spider': {'earliest': 1996, 'latest': 2006},
'cuore': {'earliest': 1977, 'latest': 2009},
's_max': {'earliest': 2006, 'latest': 2015},
'a2': {'earliest': 1999, 'latest': 2005},
'viano': {'earliest': 2003, 'latest': 2014},
'roomster': {'earliest': 2006, 'latest': 2015},
'sl': {'earliest': 1952, 'latest': 2011},
'santa': {'earliest': 1999, 'latest': 2013},
'ptcruiser':{'earliest': 2000, 'latest': 2010},
'exeo': {'earliest': 2008, 'latest': 2013},
'159': {'earliest': 2005, 'latest': 2011},
'juke': {'earliest': 2010, 'latest': 2025},
'carisma': {'earliest': 1995, 'latest': 2006},
'accord': {'earliest': 1976, 'latest': 2025},
'lanos': {'earliest': 1997, 'latest': 2009},
'phaeton': {'earliest': 2002, 'latest': 2016},
'verso': {'earliest': 2001, 'latest': 2018},
'rav': {'earliest': 1994, 'latest': 2018},
'picanto': {'earliest': 2003, 'latest': 2025},
'boxster': {'earliest': 1996, 'latest': 2025},
'kalos': {'earliest': 2002, 'latest': 2011},
'superb': {'earliest': 2001, 'latest': 2025},
'alhambra': {'earliest': 1996, 'latest': 2010},
}
for reg in reg_cols:
earliest = reg_years[reg]['earliest']
latest = reg_years[reg]['latest']
# Early
early = (df['model'] == reg) & (df['registrationyear'] < earliest)
df.loc[early,['registration_correction']] = "Y: too early"
# Late
late = (df['model'] == reg) & (df['registrationyear'] > latest)
df.loc[late,['registration_correction']] = "Y: too late"
# Acceptable Range
ar = (df['model'] == reg) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
df.loc[ar,['registration_correction']] = "N"
df['registration_correction'].isna().sum()
58277
remainder_models = df[(df['registration_correction'].isna()) & (df['model'].notna())]
remainder_models['model'].unique()
array(['other', 'roadster', 'ypsilon', 'cayenne', 'galant', 'justy', '90',
'sirion', 'crossfire', 'agila', 'duster', 'cr_reihe', 'v50',
'c_reihe', 'v_klasse', 'c5', 'aygo', 'cc', 'carnival', 'fusion',
'911', 'm_reihe', 'cl', '300c', 'spark', 'kuga', 'x_type',
'ducato', 's_type', 'x_trail', 'toledo', 'altea', 'voyager',
'calibra', 'bravo', 'antara', 'tucson', 'citigo', 'jimny',
'wrangler', 'lybra', 'q7', 'lancer', 'captiva', 'c2', 'discovery',
'freelander', 'sandero', 'note', '900', 'cherokee', 'clubman',
'samara', 'defender', 'cx_reihe', 'legacy', 'pajero', 'auris',
'niva', 's60', 'nubira', 'vivaro', 'g_klasse', 'lodgy', '850',
'range_rover', 'q3', 'serie_2', 'glk', 'charade', 'croma',
'outlander', 'doblo', 'musa', 'move', '9000', 'v60', '145', 'aveo',
'200', 'b_max', 'range_rover_sport', 'terios', 'rangerover', 'q5',
'range_rover_evoque', 'materia', 'delta', 'gl', 'kalina', 'amarok',
'elefantino', 'i3', 'kappa', 'serie_3', 'serie_1'], dtype=object)
r_cols = [
'roadster', 'ypsilon', 'cayenne', 'galant',
'justy', '90', 'sirion', 'crossfire', 'agila', 'duster',
'cr_reihe', 'v50', 'c_reihe', 'v_klasse', 'c5', 'aygo', 'cc',
'carnival', 'fusion', '911', 'm_reihe', 'cl', '300c', 'spark', 'kuga', 'x_type',
'ducato', 's_type', 'x_trail', 'toledo', 'altea', 'voyager',
'calibra', 'bravo', 'antara', 'tucson', 'citigo', 'jimny',
'wrangler', 'lybra', 'q7', 'lancer', 'captiva', 'c2', 'discovery',
'freelander', 'sandero', 'note', '900', 'cherokee', 'clubman',
'samara', 'defender', 'cx_reihe', 'legacy', 'pajero', 'auris',
'niva', 's60', 'nubira', 'vivaro', 'g_klasse', 'lodgy', '850',
'range_rover', 'q3', 'serie_2', 'glk', 'charade', 'croma',
'outlander', 'doblo', 'musa', 'move', '9000', 'v60', '145', 'aveo',
'200', 'b_max', 'range_rover_sport', 'terios', 'rangerover', 'q5',
'range_rover_evoque', 'materia', 'delta', 'gl', 'kalina', 'amarok',
'elefantino', 'i3', 'kappa', 'serie_3', 'serie_1'
]
r_years = {
'roadster': {'earliest': 1998, 'latest': 2025},
'ypsilon': {'earliest': 1995, 'latest': 2025},
'cayenne': {'earliest': 2002, 'latest': 2025},
'galant': {'earliest': 1969, 'latest': 2012},
'justy': {'earliest': 1984, 'latest': 2010},
'90': {'earliest': 1984, 'latest': 1987},
'sirion': {'earliest': 1995, 'latest': 2025},
'crossfire': {'earliest': 2003, 'latest': 2008},
'agila': {'earliest': 2000, 'latest': 2014},
'duster': {'earliest': 2010, 'latest': 2025},
'cr_reihe': {'earliest': 1995, 'latest': 2025},
'v50': {'earliest': 2004, 'latest': 2012},
'c_reihe': {'earliest': 1993, 'latest': 2025},
'v_klasse': {'earliest': 1996, 'latest': 2025},
'c5': {'earliest': 2001, 'latest': 2017},
'aygo': {'earliest': 2005, 'latest': 2025},
'cc': {'earliest': 2008, 'latest': 2017},
'carnival': {'earliest': 1998, 'latest': 2025},
'fusion': {'earliest': 2002, 'latest': 2020},
'911': {'earliest': 1963, 'latest': 2025},
'm_reihe': {'earliest': 1976, 'latest': 2025},
'cl': {'earliest': 1996, 'latest': 2014},
'300c': {'earliest': 2005, 'latest': 2020},
'spark': {'earliest': 1998, 'latest': 2025},
'kuga': {'earliest': 2008, 'latest': 2025},
'x_type': {'earliest': 2001, 'latest': 2009},
'ducato': {'earliest': 1981, 'latest': 2025},
's_type': {'earliest': 1998, 'latest': 2008},
'x_trail': {'earliest': 2000, 'latest': 2025},
'toledo': {'earliest': 1991, 'latest': 2013},
'altea': {'earliest': 2004, 'latest': 2015},
'voyager': {'earliest': 1984, 'latest': 2025},
'calibra': {'earliest': 1989, 'latest': 1997},
'bravo': {'earliest': 1995, 'latest': 2006},
'antara': {'earliest': 2006, 'latest': 2025},
'tucson': {'earliest': 2004, 'latest': 2025},
'citigo': {'earliest': 2011, 'latest': 2025},
'jimny': {'earliest': 1983, 'latest': 2025},
'wrangler': {'earliest': 1986, 'latest': 2025},
'lybra': {'earliest': 1998, 'latest': 2005},
'q7': {'earliest': 2005, 'latest': 2025},
'lancer': {'earliest': 1973, 'latest': 2017},
'captiva': {'earliest': 2006, 'latest': 2025},
'c2': {'earliest': 2003, 'latest': 2009},
'discovery': {'earliest': 1989, 'latest': 2025},
'freelander': {'earliest': 1997, 'latest': 2014},
'sandero': {'earliest': 2007, 'latest': 2025},
'note': {'earliest': 2004, 'latest': 2025},
'900': {'earliest': 1978, 'latest': 1993},
'cherokee': {'earliest': 1984, 'latest': 2025},
'clubman': {'earliest': 2007, 'latest': 2025},
'samara': {'earliest': 1984, 'latest': 2001},
'defender': {'earliest': 1983, 'latest': 2016},
'cx_reihe': {'earliest': 2006, 'latest': 2011},
'legacy': {'earliest': 1989, 'latest': 2025},
'pajero': {'earliest': 1982, 'latest': 2021},
'auris': {'earliest': 2006, 'latest': 2025},
'niva': {'earliest': 1977, 'latest': 2025},
's60': {'earliest': 2000, 'latest': 2025},
'nubira': {'earliest': 1997, 'latest': 2008},
'vivaro': {'earliest': 2001, 'latest': 2025},
'g_klasse': {'earliest': 1979, 'latest': 2025},
'lodgy': {'earliest': 2012, 'latest': 2025},
'850': {'earliest': 1991, 'latest': 1997},
'range_rover': {'earliest': 1970, 'latest': 2025},
'q3': {'earliest': 2011, 'latest': 2025},
'serie_2': {'earliest': 1958, 'latest': 2025},
'glk': {'earliest': 2008, 'latest': 2015},
'charade': {'earliest': 1977, 'latest': 2000},
'croma': {'earliest': 1985, 'latest': 2010},
'outlander': {'earliest': 2001, 'latest': 2025},
'doblo': {'earliest': 2000, 'latest': 2025},
'musa': {'earliest': 2004, 'latest': 2012},
'move': {'earliest': 1998, 'latest': 2002},
'9000': {'earliest': 1985, 'latest': 1998},
'v60': {'earliest': 2010, 'latest': 2025},
'145': {'earliest': 1994, 'latest': 2000},
'aveo': {'earliest': 2002, 'latest': 2011},
'200': {'earliest': 1980, 'latest': 2007},
'b_max': {'earliest': 2007, 'latest': 2012},
'range_rover_sport': {'earliest': 2005, 'latest': 2025},
'terios': {'earliest': 1997, 'latest': 2017},
'rangerover': {'earliest': 1970, 'latest': 2025},
'q5': {'earliest': 2008, 'latest': 2025},
'range_rover_evoque':{'earliest': 2011, 'latest': 2025},
'materia': {'earliest': 2007, 'latest': 2012},
'delta': {'earliest': 1979, 'latest': 2014},
'gl': {'earliest': 2006, 'latest': 2015},
'kalina': {'earliest': 2004, 'latest': 2018},
'amarok': {'earliest': 2010, 'latest': 2025},
'elefantino': {'earliest': 1963, 'latest': 2011},
'i3': {'earliest': 2013, 'latest': 2025},
'kappa': {'earliest': 1994, 'latest': 2001},
'serie_3': {'earliest': 1975, 'latest': 2025},
'serie_1': {'earliest': 2004, 'latest': 2025}
}
for r in r_cols:
earliest = r_years[r]['earliest']
latest = r_years[r]['latest']
# early
e = (df['model'] == r) & (df['registrationyear'] < earliest)
df.loc[e,['registration_correction']] = "Y: too early"
#late
l = (df['model'] == r) & (df['registrationyear'] > latest)
df.loc[l,['registration_correction']] = "Y: too late"
# acceptable
a = (df['model'] == r) & (df['registrationyear'] >= earliest) & (df['registrationyear'] <= latest)
df.loc[a,['registration_correction']] = "N"
df['registration_correction'].isna().sum()
44028
remainder_models = df[(df['registration_correction'].isna()) & (df['model'].notna())]
remainder_models['model'].value_counts()
other 24421 Name: model, dtype: int64
# Mark incorrect registration years as Nan
other_reg_less = (df['model'] == 'other') & (df['registrationyear'] < 1893)
other_reg_more = (df['model'] == 'other') & (df['registrationyear'] > 2025)
df.loc[other_reg_more,['registration_correction']] = "Y: too late (other)"
df.loc[other_reg_less,['registration_correction']] = "Y: too early (other)"
df.loc[other_reg_more,['registrationyear']] = np.nan
df.loc[other_reg_less,['registrationyear']] = np.nan
# Mark remaining incorrect registration years as Nan
incorrect = ((df['registrationyear'] < 1893) | (df['registrationyear'] > 2025))
df.loc[incorrect,['registrationyear']] = np.nan
del incorrect
df[(df['registration_correction'].isna()) & (df['model'] == 'other')].value_counts(subset = 'brand')
brand_reg_cols = [
'mercedes_benz', 'citroen', 'fiat', 'ford', 'hyundai', 'peugeot', 'opel',
'suzuki', 'audi', 'mazda', 'renault', 'chevrolet', 'toyota', 'mitsubishi',
'volkswagen', 'nissan', 'volvo', 'alfa_romeo', 'kia', 'rover', 'chrysler',
'saab', 'honda', 'skoda', 'bmw', 'jaguar', 'porsche', 'jeep', 'seat',
'daihatsu', 'lancia', 'mini', 'daewoo', 'trabant', 'smart', 'subaru',
'lada', 'dacia', 'land_rover'
]
brand_registration_years = {
'mercedes_benz': {'earliest': 1926, 'latest': 2025},
'citroen': {'earliest': 1919, 'latest': 2025},
'fiat': {'earliest': 1899, 'latest': 2025},
'ford': {'earliest': 1903, 'latest': 2025},
'hyundai': {'earliest': 1967, 'latest': 2025},
'peugeot': {'earliest': 1889, 'latest': 2025},
'opel': {'earliest': 1899, 'latest': 2025},
'suzuki': {'earliest': 1955, 'latest': 2025},
'audi': {'earliest': 1910, 'latest': 2025},
'mazda': {'earliest': 1931, 'latest': 2025},
'renault': {'earliest': 1898, 'latest': 2025},
'chevrolet': {'earliest': 1911, 'latest': 2025},
'toyota': {'earliest': 1936, 'latest': 2025},
'mitsubishi': {'earliest': 1917, 'latest': 2025},
'volkswagen': {'earliest': 1937, 'latest': 2025},
'nissan': {'earliest': 1933, 'latest': 2025},
'volvo': {'earliest': 1927, 'latest': 2025},
'alfa_romeo': {'earliest': 1910, 'latest': 2025},
'kia': {'earliest': 1944, 'latest': 2025},
'rover': {'earliest': 1904, 'latest': 2005},
'chrysler': {'earliest': 1925, 'latest': 2025},
'saab': {'earliest': 1947, 'latest': 2011},
'honda': {'earliest': 1963, 'latest': 2025},
'skoda': {'earliest': 1905, 'latest': 2025},
'bmw': {'earliest': 1928, 'latest': 2025},
'jaguar': {'earliest': 1935, 'latest': 2025},
'porsche': {'earliest': 1948, 'latest': 2025},
'jeep': {'earliest': 1941, 'latest': 2025},
'seat': {'earliest': 1950, 'latest': 2025},
'daihatsu': {'earliest': 1951, 'latest': 2025},
'lancia': {'earliest': 1908, 'latest': 2025},
'mini': {'earliest': 1959, 'latest': 2025},
'daewoo': {'earliest': 1937, 'latest': 2011},
'trabant': {'earliest': 1957, 'latest': 1991},
'smart': {'earliest': 1998, 'latest': 2025},
'subaru': {'earliest': 1954, 'latest': 2025},
'lada': {'earliest': 1966, 'latest': 2025},
'dacia': {'earliest': 1966, 'latest': 2025},
'land_rover': {'earliest': 1948, 'latest': 2025}
}
for brand in brand_reg_cols:
other = (df['model'] == 'other')
reg_corr_na = (df['registration_correction'].isna())
earliest = brand_registration_years[brand]['earliest']
latest = brand_registration_years[brand]['latest']
# too early
te = (df['brand'] == brand) & other & reg_corr_na & (df['registrationyear'] < earliest)
df.loc[te,['registration_correction']] = "Y: too early (other)"
# too late
tl = (df['brand'] == brand) & other & reg_corr_na & (df['registrationyear'] > latest)
df.loc[tl,['registration_correction']] = "Y: too late (other)"
# acceptable
accept = (df['brand'] == brand) & other & reg_corr_na & (df['registrationyear'] >= earliest) \
& (df['registrationyear'] <= latest)
df.loc[accept,['registration_correction']] = "N"
df['registration_correction'].isna().sum()
19607
for brand in brand_reg_cols:
reg_corr_na = (df['registration_correction'].isna())
earliest = brand_registration_years[brand]['earliest']
latest = brand_registration_years[brand]['latest']
# too early
ear = (df['brand'] == brand) & reg_corr_na & (df['registrationyear'] < earliest)
df.loc[ear,['registration_correction']] = "Y: too early"
# too late
lat = (df['brand'] == brand) & reg_corr_na & (df['registrationyear'] > latest)
df.loc[lat,['registration_correction']] = "Y: too late"
# acceptable
apt = (df['brand'] == brand) & reg_corr_na & (df['registrationyear'] >= earliest) \
& (df['registrationyear'] <= latest)
df.loc[apt,['registration_correction']] = "N"
df['registration_correction'].isna().sum()
3338
df[(df['registration_correction'].isna()) & ((df['registrationyear'] < 1893) | (df['registrationyear'] > 2025))]
# Mark the remaining Nan values in registration_correction as "N"
df.loc[(df['registration_correction'].isna()), ['registration_correction']] = "N"
df['registration_correction'].value_counts()
other_l = df['registration_correction'] == 'Y: too late (other)'
other_e = df['registration_correction'] == 'Y: too early (other)'
df.loc[other_l,['registration_correction']] = "Y: too late"
df.loc[other_e,['registration_correction']] = "Y: too early"
df['registration_correction'].value_counts()
N 349367 Y: too late 2950 Y: too early 1955 Y: missing 97 Name: registration_correction, dtype: int64
del other_l
del other_e
display(df[(df['registration_correction'] == 'Y: missing') & (df['brand'] != 'sonstige_autos')])
#pd.merge(df[(df['registration_correction'] == 'Y: missing') & (df['brand'] != 'sonstige_autos')],on = 'index', how = 'left')
index = car_dates[car_dates['registrationyear'].notna()].index
display(index)
values = car_dates.loc[index, 'registrationyear'].values
display(values)
print("")
print("")
print("")
dictionary = dict(zip(index,values))
df.loc[index, ['registrationyear']] = values
display(df.loc[index])
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 622 | 16/03/2016 16:55 | 0 | NaN | NaN | NaN | 0 | NaN | 5000 | 0 | NaN | opel | NaN | 16/03/2016 00:00 | 0 | 44628 | 20/03/2016 16:44 | Y: missing |
| 18023 | 24/03/2016 08:57 | 1 | NaN | NaN | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 24/03/2016 00:00 | 0 | 50829 | 06/04/2016 22:45 | Y: missing |
| 24458 | 29/03/2016 19:50 | 50 | small | NaN | manual | 0 | NaN | 5000 | 1 | NaN | volkswagen | yes | 29/03/2016 00:00 | 0 | 91487 | 06/04/2016 05:46 | Y: missing |
| 32768 | 11/03/2016 17:53 | 1500 | small | NaN | manual | 75 | NaN | 100000 | 4 | petrol | smart | NaN | 11/03/2016 00:00 | 0 | 47055 | 05/04/2016 18:45 | Y: missing |
| 34332 | 01/04/2016 06:02 | 450 | NaN | NaN | NaN | 1800 | NaN | 5000 | 2 | NaN | mitsubishi | no | 01/04/2016 00:00 | 0 | 63322 | 01/04/2016 09:42 | Y: missing |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 323443 | 26/03/2016 20:58 | 30 | NaN | NaN | NaN | 0 | NaN | 5000 | 0 | NaN | audi | NaN | 26/03/2016 00:00 | 0 | 37574 | 06/04/2016 12:17 | Y: missing |
| 325739 | 30/03/2016 11:36 | 400 | NaN | NaN | NaN | 0 | NaN | 5000 | 0 | NaN | mercedes_benz | NaN | 30/03/2016 00:00 | 0 | 8060 | 01/04/2016 06:16 | Y: missing |
| 333004 | 20/03/2016 14:57 | 0 | suv | NaN | manual | 0 | NaN | 5000 | 0 | NaN | toyota | NaN | 20/03/2016 00:00 | 0 | 48683 | 20/03/2016 14:57 | Y: missing |
| 333488 | 23/03/2016 01:36 | 0 | NaN | NaN | NaN | 0 | NaN | 10000 | 0 | NaN | bmw | NaN | 23/03/2016 00:00 | 0 | 32689 | 23/03/2016 08:47 | Y: missing |
| 343083 | 01/04/2016 08:51 | 1 | other | NaN | NaN | 0 | NaN | 5000 | 1 | other | volkswagen | NaN | 01/04/2016 00:00 | 0 | 18273 | 07/04/2016 05:44 | Y: missing |
61 rows × 17 columns
Int64Index([ 622, 15147, 15870, 17346, 20159, 34332, 38875, 41170,
46935, 55605, 60017, 60079, 66198, 70847, 78128, 84841,
87522, 91869, 94926, 110123, 112768, 118047, 122692, 128677,
129221, 129980, 130474, 135865, 139360, 139756, 146323, 146507,
148570, 149151, 151228, 151725, 158283, 167937, 172242, 174531,
177353, 183779, 184598, 200525, 202258, 206219, 214830, 215678,
220638, 221736, 224832, 226526, 230741, 233631, 234896, 242233,
243656, 244092, 244254, 248137, 252476, 255866, 260401, 268091,
272024, 278390, 278517, 290609, 295172, 316487, 323443, 325739,
333488, 340548, 348830],
dtype='int64')
array([1111, 9999, 3200, 8888, 4100, 1800, 1234, 5300, 6000, 1000, 1000,
9999, 1000, 1255, 1000, 3800, 4800, 1000, 7000, 1000, 1000, 6000,
2500, 9999, 1000, 1000, 9450, 1000, 1800, 2500, 1234, 5000, 1688,
9999, 9999, 1000, 6000, 9999, 2800, 1253, 9999, 1000, 9999, 9999,
9000, 5600, 1600, 1111, 2222, 1039, 9999, 3000, 1000, 1000, 9996,
1000, 1000, 1000, 3000, 6000, 2222, 2800, 8455, 9999, 5000, 4500,
1500, 1500, 9229, 5000, 1000, 1000, 9999, 3500, 1000])
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 622 | 16/03/2016 16:55 | 0 | NaN | 1111.0 | NaN | 0 | NaN | 5000 | 0 | NaN | opel | NaN | 16/03/2016 00:00 | 0 | 44628 | 20/03/2016 16:44 | Y: missing |
| 15147 | 14/03/2016 00:52 | 0 | NaN | 9999.0 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 13/03/2016 00:00 | 0 | 32689 | 21/03/2016 23:46 | Y: missing |
| 15870 | 02/04/2016 11:55 | 1700 | NaN | 3200.0 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 33649 | 06/04/2016 09:46 | Y: missing |
| 17346 | 06/03/2016 16:06 | 6500 | NaN | 8888.0 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 06/03/2016 00:00 | 0 | 55262 | 30/03/2016 20:46 | Y: missing |
| 20159 | 01/04/2016 21:57 | 1600 | NaN | 4100.0 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 01/04/2016 00:00 | 0 | 67686 | 05/04/2016 20:19 | Y: missing |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 323443 | 26/03/2016 20:58 | 30 | NaN | 1000.0 | NaN | 0 | NaN | 5000 | 0 | NaN | audi | NaN | 26/03/2016 00:00 | 0 | 37574 | 06/04/2016 12:17 | Y: missing |
| 325739 | 30/03/2016 11:36 | 400 | NaN | 1000.0 | NaN | 0 | NaN | 5000 | 0 | NaN | mercedes_benz | NaN | 30/03/2016 00:00 | 0 | 8060 | 01/04/2016 06:16 | Y: missing |
| 333488 | 23/03/2016 01:36 | 0 | NaN | 9999.0 | NaN | 0 | NaN | 10000 | 0 | NaN | bmw | NaN | 23/03/2016 00:00 | 0 | 32689 | 23/03/2016 08:47 | Y: missing |
| 340548 | 02/04/2016 17:44 | 0 | NaN | 3500.0 | manual | 75 | NaN | 5000 | 3 | petrol | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 96465 | 04/04/2016 15:17 | Y: missing |
| 348830 | 22/03/2016 00:38 | 1 | NaN | 1000.0 | NaN | 1000 | NaN | 150000 | 0 | NaN | sonstige_autos | NaN | 21/03/2016 00:00 | 0 | 41472 | 05/04/2016 14:18 | Y: missing |
75 rows × 17 columns
ind97 = df.loc[[32768, 261138, 320335]]
ind95 = df.loc[[148233]]
ind96 = df.loc[[176475]]
df.loc[ind97.index,['registrationyear']] = 1997
df.loc[ind95.index,['registrationyear']] = 1995
df.loc[ind96.index,['registrationyear']] = 1996
df.loc[[32768, 261138, 320335, 148233, 176475]]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 32768 | 11/03/2016 17:53 | 1500 | small | 1997.0 | manual | 75 | NaN | 100000 | 4 | petrol | smart | NaN | 11/03/2016 00:00 | 0 | 47055 | 05/04/2016 18:45 | Y: missing |
| 261138 | 28/03/2016 11:56 | 500 | wagon | 1997.0 | manual | 90 | NaN | 150000 | 12 | gasoline | smart | yes | 28/03/2016 00:00 | 0 | 99310 | 06/04/2016 15:15 | Y: missing |
| 320335 | 15/03/2016 19:58 | 850 | wagon | 1997.0 | auto | 170 | NaN | 150000 | 0 | NaN | smart | no | 15/03/2016 00:00 | 0 | 4205 | 16/03/2016 17:51 | Y: missing |
| 148233 | 02/04/2016 21:57 | 1000 | small | 1995.0 | manual | 60 | NaN | 150000 | 0 | petrol | smart | no | 02/04/2016 00:00 | 0 | 6667 | 06/04/2016 22:46 | Y: missing |
| 176475 | 07/03/2016 09:52 | 1000 | NaN | 1996.0 | auto | 0 | NaN | 150000 | 0 | NaN | smart | NaN | 07/03/2016 00:00 | 0 | 3222 | 08/03/2016 03:46 | Y: missing |
del ind97
del ind95
del ind96
display(df['registrationyear'].isna().sum())
index_rc = df[(df['registration_correction'] == "Y: missing") & (df['registrationyear'].isna())].index
display(index_rc)
index_replace = df1.loc[index_rc]
display(index_replace)
df.loc[index_rc,['registrationyear']] = 1910
113
Int64Index([ 18023, 24458, 64345, 69320, 90011, 150021, 154571, 155833,
166750, 188748, 190238, 212091, 225151, 273431, 321782, 333004,
343083],
dtype='int64')
| DateCrawled | Price | VehicleType | RegistrationYear | Gearbox | Power | Model | Mileage | RegistrationMonth | FuelType | Brand | NotRepaired | DateCreated | NumberOfPictures | PostalCode | LastSeen | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18023 | 24/03/2016 08:57 | 1 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 24/03/2016 00:00 | 0 | 50829 | 06/04/2016 22:45 |
| 24458 | 29/03/2016 19:50 | 50 | small | 1910 | manual | 0 | NaN | 5000 | 1 | NaN | volkswagen | yes | 29/03/2016 00:00 | 0 | 91487 | 06/04/2016 05:46 |
| 64345 | 11/03/2016 09:37 | 160 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | hyundai | NaN | 11/03/2016 00:00 | 0 | 52525 | 24/03/2016 10:15 |
| 69320 | 11/03/2016 22:53 | 20 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 11/03/2016 00:00 | 0 | 6618 | 25/03/2016 16:16 |
| 90011 | 03/04/2016 09:02 | 5000 | NaN | 1910 | NaN | 0 | NaN | 150000 | 0 | petrol | bmw | NaN | 03/04/2016 00:00 | 0 | 21079 | 07/04/2016 10:45 |
| 150021 | 11/03/2016 22:56 | 20 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 11/03/2016 00:00 | 0 | 6618 | 26/03/2016 06:46 |
| 154571 | 24/03/2016 09:57 | 0 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | jeep | NaN | 24/03/2016 00:00 | 0 | 24622 | 27/03/2016 05:46 |
| 155833 | 11/03/2016 22:37 | 15 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 11/03/2016 00:00 | 0 | 90491 | 25/03/2016 11:18 |
| 166750 | 17/03/2016 19:40 | 99 | NaN | 1910 | NaN | 0 | NaN | 150000 | 0 | NaN | subaru | yes | 17/03/2016 00:00 | 0 | 21635 | 17/03/2016 19:40 |
| 188748 | 24/03/2016 13:46 | 0 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | bmw | NaN | 24/03/2016 00:00 | 0 | 1279 | 07/04/2016 05:16 |
| 190238 | 11/03/2016 23:49 | 15 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 11/03/2016 00:00 | 0 | 6618 | 26/03/2016 06:46 |
| 212091 | 02/04/2016 21:48 | 200 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | trabant | NaN | 02/04/2016 00:00 | 0 | 2627 | 06/04/2016 22:44 |
| 225151 | 09/03/2016 17:48 | 0 | NaN | 1910 | NaN | 0 | NaN | 150000 | 0 | NaN | trabant | NaN | 09/03/2016 00:00 | 0 | 26676 | 09/03/2016 17:48 |
| 273431 | 09/03/2016 13:50 | 2500 | NaN | 1910 | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 09/03/2016 00:00 | 0 | 59320 | 15/03/2016 14:46 |
| 321782 | 25/03/2016 18:50 | 0 | small | 1910 | manual | 600 | NaN | 150000 | 5 | NaN | volkswagen | yes | 25/03/2016 00:00 | 0 | 35764 | 25/03/2016 21:27 |
| 333004 | 20/03/2016 14:57 | 0 | suv | 1910 | manual | 0 | NaN | 5000 | 0 | NaN | toyota | NaN | 20/03/2016 00:00 | 0 | 48683 | 20/03/2016 14:57 |
| 343083 | 01/04/2016 08:51 | 1 | other | 1910 | NaN | 0 | NaN | 5000 | 1 | other | volkswagen | NaN | 01/04/2016 00:00 | 0 | 18273 | 07/04/2016 05:44 |
del index_rc
gc.collect()
0
display(df[df['registration_correction'] == "Y: missing"])
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 622 | 16/03/2016 16:55 | 0 | NaN | 1111.0 | NaN | 0 | NaN | 5000 | 0 | NaN | opel | NaN | 16/03/2016 00:00 | 0 | 44628 | 20/03/2016 16:44 | Y: missing |
| 15147 | 14/03/2016 00:52 | 0 | NaN | 9999.0 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 13/03/2016 00:00 | 0 | 32689 | 21/03/2016 23:46 | Y: missing |
| 15870 | 02/04/2016 11:55 | 1700 | NaN | 3200.0 | NaN | 0 | NaN | 5000 | 0 | NaN | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 33649 | 06/04/2016 09:46 | Y: missing |
| 17346 | 06/03/2016 16:06 | 6500 | NaN | 8888.0 | NaN | 0 | NaN | 10000 | 0 | NaN | sonstige_autos | NaN | 06/03/2016 00:00 | 0 | 55262 | 30/03/2016 20:46 | Y: missing |
| 18023 | 24/03/2016 08:57 | 1 | NaN | 1910.0 | NaN | 0 | NaN | 5000 | 0 | NaN | volkswagen | NaN | 24/03/2016 00:00 | 0 | 50829 | 06/04/2016 22:45 | Y: missing |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 333004 | 20/03/2016 14:57 | 0 | suv | 1910.0 | manual | 0 | NaN | 5000 | 0 | NaN | toyota | NaN | 20/03/2016 00:00 | 0 | 48683 | 20/03/2016 14:57 | Y: missing |
| 333488 | 23/03/2016 01:36 | 0 | NaN | 9999.0 | NaN | 0 | NaN | 10000 | 0 | NaN | bmw | NaN | 23/03/2016 00:00 | 0 | 32689 | 23/03/2016 08:47 | Y: missing |
| 340548 | 02/04/2016 17:44 | 0 | NaN | 3500.0 | manual | 75 | NaN | 5000 | 3 | petrol | sonstige_autos | NaN | 02/04/2016 00:00 | 0 | 96465 | 04/04/2016 15:17 | Y: missing |
| 343083 | 01/04/2016 08:51 | 1 | other | 1910.0 | NaN | 0 | NaN | 5000 | 1 | other | volkswagen | NaN | 01/04/2016 00:00 | 0 | 18273 | 07/04/2016 05:44 | Y: missing |
| 348830 | 22/03/2016 00:38 | 1 | NaN | 1000.0 | NaN | 1000 | NaN | 150000 | 0 | NaN | sonstige_autos | NaN | 21/03/2016 00:00 | 0 | 41472 | 05/04/2016 14:18 | Y: missing |
97 rows × 17 columns
for brand in brand_reg_cols:
y_miss = (df['registration_correction'] == 'Y: missing')
earliest = brand_registration_years[brand]['earliest']
latest = brand_registration_years[brand]['latest']
# too early
earl = (df['brand'] == brand) & y_miss & (df['registrationyear'] < earliest)
df.loc[earl,['registration_correction']] = "Y: too early"
# too late
late = (df['brand'] == brand) & y_miss & (df['registrationyear'] > latest)
df.loc[late,['registration_correction']] = "Y: too late"
# acceptable
ap = (df['brand'] == brand) & y_miss & (df['registrationyear'] >= earliest) \
& (df['registrationyear'] <= latest)
df.loc[ap,['registration_correction']] = "N"
y_less = (df['registration_correction'] == "Y: missing") & (df['registrationyear'] < 1893)
y_more = (df['registration_correction'] == "Y: missing") & (df['registrationyear'] > 2025)
acceptable = (df['registration_correction'] == "Y: missing") & (df['registrationyear'] > 1893) \
& (df['registrationyear'] < 2025)
df.loc[y_less, ['registration_correction']] = "Y: too early"
df.loc[y_more, ['registration_correction']] = "Y: too late"
df.loc[acceptable,['registration_correction']] = "N"
del y_less
del y_more
del acceptable
gc.collect()
0
display(df['registration_correction'].value_counts())
N 349367 Y: too late 2992 Y: too early 2010 Name: registration_correction, dtype: int64
index = df[df['registrationyear'].isna()].index
values = df1['RegistrationYear'].loc[index]
display(values)
df.loc[index,['registrationyear']] = values
df.loc[index]
12946 5000
16062 1000
17271 9999
18224 1500
18259 2200
...
335727 7500
338829 3000
340759 1600
341791 3000
351682 1800
Name: RegistrationYear, Length: 96, dtype: int64
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12946 | 29/03/2016 18:39 | 49 | NaN | 5000.0 | NaN | 0 | golf | 5000 | 12 | NaN | volkswagen | NaN | 29/03/2016 00:00 | 0 | 74523 | 06/04/2016 04:16 | Y: too late |
| 16062 | 29/03/2016 23:42 | 190 | NaN | 1000.0 | NaN | 0 | mondeo | 5000 | 0 | NaN | ford | NaN | 29/03/2016 00:00 | 0 | 47166 | 06/04/2016 10:44 | Y: too early |
| 17271 | 23/03/2016 16:43 | 700 | NaN | 9999.0 | NaN | 0 | other | 10000 | 0 | NaN | opel | NaN | 23/03/2016 00:00 | 0 | 21769 | 05/04/2016 20:16 | Y: too late |
| 18224 | 09/03/2016 17:49 | 7999 | NaN | 1500.0 | manual | 224 | impreza | 5000 | 3 | NaN | subaru | NaN | 09/03/2016 00:00 | 0 | 53577 | 15/03/2016 05:15 | Y: too early |
| 18259 | 16/03/2016 20:37 | 300 | NaN | 2200.0 | NaN | 0 | twingo | 5000 | 12 | NaN | renault | NaN | 16/03/2016 00:00 | 0 | 45307 | 07/04/2016 06:45 | Y: too late |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 335727 | 09/03/2016 07:01 | 0 | NaN | 7500.0 | manual | 0 | other | 10000 | 0 | petrol | mini | no | 09/03/2016 00:00 | 0 | 9669 | 19/03/2016 19:44 | Y: too late |
| 338829 | 24/03/2016 19:49 | 50 | NaN | 3000.0 | NaN | 3000 | golf | 100000 | 6 | NaN | volkswagen | yes | 24/03/2016 00:00 | 0 | 23992 | 03/04/2016 13:17 | Y: too late |
| 340759 | 04/04/2016 23:55 | 700 | NaN | 1600.0 | manual | 1600 | a3 | 150000 | 4 | petrol | audi | no | 04/04/2016 00:00 | 0 | 86343 | 05/04/2016 06:44 | Y: too early |
| 341791 | 28/03/2016 17:37 | 1 | NaN | 3000.0 | NaN | 0 | zafira | 5000 | 0 | NaN | opel | NaN | 28/03/2016 00:00 | 0 | 26624 | 02/04/2016 22:17 | Y: too late |
| 351682 | 12/03/2016 00:57 | 11500 | NaN | 1800.0 | NaN | 16 | other | 5000 | 6 | petrol | fiat | NaN | 11/03/2016 00:00 | 0 | 16515 | 05/04/2016 19:47 | Y: too early |
96 rows × 17 columns
Duplicate Handling¶
display(df.duplicated().sum())
df[df.duplicated()]
df = df.drop_duplicates()
df.duplicated().sum()
262
0
Missing Values¶
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 354107 entries, 0 to 354368 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 354107 non-null object 1 price 354107 non-null int64 2 vehicletype 316623 non-null object 3 registrationyear 354107 non-null float64 4 gearbox 334277 non-null object 5 power 354107 non-null int64 6 model 334407 non-null object 7 mileage 354107 non-null int64 8 registrationmonth 354107 non-null int64 9 fueltype 321218 non-null object 10 brand 354107 non-null object 11 notrepaired 282962 non-null object 12 datecreated 354107 non-null object 13 numberofpictures 354107 non-null int64 14 postalcode 354107 non-null int64 15 lastseen 354107 non-null object 16 registration_correction 354107 non-null object dtypes: float64(1), int64(6), object(10) memory usage: 48.6+ MB
Missing Values:
| Column | Percent Missing |
|---|---|
| Vehicle Type: | 10.586 % |
| GearBox: | 5.600 % |
| Model: | 5.564 % |
| FuelType: | 9.288 % |
| NotReparied: | 20.091 % |
# Percent Missing
print("Percent Missing")
print("===============")
vt = 354107 - 316623
vtp = (vt/354107) * 100
print(f"Vehicle Type: \n{vtp:.3f} %")
print("")
gb = 354107 - 334277
gbp = (gb/354107) * 100
print(f"GearBox: \n{gbp:.3f} %")
print("")
m = 354107 - 334406
mp = (m/354107) * 100
print(f"Model: \n{mp:.3f} %")
print("")
ft = 354107 - 321218
ftp = (ft/354107) * 100
print(f"FuelType: \n{ftp:.3f} %")
print("")
nr = 354107 - 282962
nrp = (nr/354107) * 100
print(f"NotReparied: \n{nrp:.3f} %")
Percent Missing =============== Vehicle Type: 10.586 % GearBox: 5.600 % Model: 5.564 % FuelType: 9.288 % NotReparied: 20.091 %
# Inpect Model Column
model_col = df[(df['model'].isna()) & (df['brand'] != 'sonstige_autos')]
model_col_p0 = model_col[model_col['power'] == 0]
brand_p0 = model_col_p0['brand'].value_counts()
brand_p0.plot(kind='bar', x='brand', y='power', figsize=(12,6))
plt.title('0hp powered vehicles by brand (model info. missing)')
plt.xlabel('Brand')
plt.ylabel('0hp Power Frequency')
plt.xticks(rotation=90)
plt.tight_layout()
plt.show()
display(brand_p0)
display(model_col_p0['brand'].unique())
brand_p0_rows_bottom = ['mitsubishi', 'skoda', 'chevrolet', 'kia', 'porsche', 'chrysler', 'volvo', 'rover', 'daihatsu',
'daewoo', 'subaru', 'mini', 'lada', 'dacia', 'jeep', 'jaguar', 'lancia', 'saab', 'land_rover']
brand_p0_rows_mbottom = ['citroen', 'seat', 'hyundai', 'nissan', 'trabant', 'suzuki', 'toyota',
'alfa_romeo', 'honda']
brand_p0_rows_middle = ['ford', 'audi', 'peugeot', 'renault', 'fiat', 'mazda', 'smart']
brand_p0_rows_top = ['bmw', 'opel', 'mercedes_benz']
brand_p0_rows_vw = ['volkswagen']
# Separate by brand and known model
model_col_notna_top_bottom = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_bottom))]
model_col_notna_top_mbottom = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_mbottom))]
model_col_notna_top_middle = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_middle))]
model_col_notna_top = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_top))]
model_col_notna_top_vw = df[(df['model'].notna()) & (df['brand'] != 'sonstige_autos') & (df['power'] == 0) \
& (df['brand'].isin(brand_p0_rows_vw))]
# Use known model to find NaN values
# Least missing
brand_p0_notna_top_bottom = model_col_notna_top_bottom[['brand','model']].value_counts().sort_index()
# Middle Least missing
brand_p0_notna_top_mbottom = model_col_notna_top_mbottom[['brand','model']].value_counts().sort_index()
# Middle missing
brand_p0_notna_top_middle = model_col_notna_top_middle[['brand','model']].value_counts().sort_index()
# Top Missing
brand_p0_notna_top = model_col_notna_top[['brand','model']].value_counts().sort_index()
# Volkswagen
brand_p0_notna_top_vw = model_col_notna_top_vw[['brand','model']].value_counts()
# Least Missing
with pd.option_context('display.max_rows', None):
display(brand_p0_notna_top_bottom)
brand_p0_notna_top_bottom.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Least Missing/ Bottom Tier (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Next Least Missing
display(brand_p0_notna_top_mbottom)
brand_p0_notna_top_mbottom.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Least Missing Ext/ Bottom Tier 2 (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Middle Missing
with pd.option_context('display.max_rows', None):
display(brand_p0_notna_top_middle)
brand_p0_notna_top_middle.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Middle Missing (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Top Missing
display(brand_p0_notna_top)
brand_p0_notna_top.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Top Missing (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
# Volkswagen
display(brand_p0_notna_top_vw)
brand_p0_notna_top_vw.plot(kind = 'bar', x = ('brand','model'), y = 'power', figsize = (12,6))
plt.title('0h powered vehicles by brand: Volkswagen (model info available)')
plt.xlabel("Brand, Model")
plt.ylabel("0hp Power Frequency")
plt.xticks(rotation = 90)
plt.tight_layout()
plt.show()
volkswagen 990 bmw 536 opel 512 mercedes_benz 429 ford 341 audi 325 peugeot 272 renault 254 fiat 182 mazda 114 smart 100 citroen 78 seat 73 hyundai 68 nissan 63 trabant 58 suzuki 50 toyota 47 alfa_romeo 42 honda 40 mitsubishi 38 skoda 38 chevrolet 34 kia 27 porsche 25 chrysler 25 volvo 25 rover 23 daihatsu 19 daewoo 16 subaru 12 mini 12 dacia 7 lada 7 jaguar 6 jeep 6 lancia 5 saab 3 land_rover 2 Name: brand, dtype: int64
array(['volkswagen', 'renault', 'mitsubishi', 'bmw', 'peugeot', 'audi',
'volvo', 'chevrolet', 'trabant', 'opel', 'smart', 'nissan',
'suzuki', 'mercedes_benz', 'mazda', 'seat', 'fiat', 'citroen',
'ford', 'skoda', 'kia', 'chrysler', 'daewoo', 'alfa_romeo',
'rover', 'porsche', 'dacia', 'honda', 'lada', 'subaru', 'hyundai',
'toyota', 'mini', 'jaguar', 'daihatsu', 'saab', 'land_rover',
'lancia', 'jeep'], dtype=object)
brand model
chevrolet aveo 8
captiva 8
matiz 41
other 129
spark 5
chrysler 300c 17
crossfire 6
grand 8
other 45
ptcruiser 34
voyager 65
dacia duster 13
lodgy 5
logan 33
other 1
sandero 12
daewoo kalos 16
lanos 18
matiz 39
nubira 7
other 12
daihatsu charade 6
cuore 77
materia 1
move 14
other 15
sirion 13
terios 3
jaguar other 17
s_type 10
x_type 24
jeep cherokee 29
grand 17
other 12
wrangler 12
kia carnival 52
ceed 9
other 63
picanto 29
rio 33
sorento 30
sportage 17
lada kalina 4
niva 26
other 17
samara 8
lancia delta 4
elefantino 1
kappa 2
lybra 11
musa 2
other 11
ypsilon 21
land_rover defender 13
discovery 9
freelander 27
other 4
range_rover 7
range_rover_sport 2
serie_1 2
serie_2 2
serie_3 1
mini clubman 6
cooper 63
one 27
other 14
mitsubishi carisma 56
colt 83
galant 32
lancer 35
other 92
outlander 10
pajero 21
porsche 911 28
boxster 15
cayenne 8
other 37
rover freelander 1
other 57
rangerover 1
saab 900 15
9000 2
other 12
skoda citigo 2
fabia 142
octavia 141
other 51
roomster 11
superb 10
yeti 1
subaru forester 9
impreza 21
justy 18
legacy 14
other 4
volvo 850 25
c_reihe 6
other 55
s60 1
v40 97
v50 8
v60 2
v70 47
xc_reihe 7
dtype: int64
brand model
alfa_romeo 145 11
147 39
156 58
159 10
other 39
spider 23
citroen berlingo 94
c1 32
c2 37
c3 48
c4 27
c5 41
other 285
honda accord 30
civic 130
cr_reihe 14
jazz 19
other 38
hyundai getz 60
i_reihe 56
other 140
santa 18
tucson 11
nissan almera 73
juke 4
micra 301
navara 14
note 7
other 79
primera 86
qashqai 26
x_trail 20
seat alhambra 27
altea 15
arosa 127
cordoba 60
ibiza 206
leon 33
other 31
toledo 43
suzuki grand 9
jimny 19
other 124
swift 69
toyota auris 15
avensis 35
aygo 41
corolla 82
other 94
rav 26
verso 16
yaris 90
trabant 601 165
other 38
dtype: int64
brand model
audi 100 33
200 2
80 212
90 15
a1 12
a2 39
a3 490
a4 729
a5 17
a6 363
a8 44
other 46
q3 1
q5 2
q7 17
tt 32
fiat 500 57
bravo 38
croma 7
doblo 32
ducato 98
other 244
panda 86
punto 518
seicento 129
stilo 71
ford b_max 1
c_max 36
escort 143
fiesta 726
focus 537
fusion 22
galaxy 148
ka 536
kuga 11
mondeo 433
mustang 32
other 203
s_max 5
transit 105
mazda 1_reihe 17
3_reihe 137
5_reihe 21
6_reihe 124
cx_reihe 4
mx_reihe 62
other 146
rx_reihe 12
peugeot 1_reihe 132
2_reihe 322
3_reihe 208
4_reihe 56
5_reihe 7
other 184
renault clio 523
espace 98
kangoo 181
laguna 181
megane 399
modus 38
other 104
r19 22
scenic 216
twingo 955
smart forfour 30
fortwo 423
other 24
roadster 12
dtype: int64
brand model
bmw 1er 123
3er 1534
5er 499
6er 13
7er 79
i3 3
m_reihe 17
other 85
x_reihe 125
z_reihe 22
mercedes_benz a_klasse 539
b_klasse 54
c_klasse 744
cl 21
clk 138
e_klasse 638
g_klasse 14
gl 1
glk 4
m_klasse 71
other 326
s_klasse 114
sl 49
slk 76
sprinter 132
v_klasse 24
viano 31
vito 127
opel agila 70
antara 13
astra 1111
calibra 22
combo 57
corsa 1770
insignia 18
kadett 74
meriva 61
omega 163
other 152
signum 29
tigra 72
vectra 519
vivaro 28
zafira 353
dtype: int64
brand model
volkswagen golf 2460
polo 1611
passat 883
transporter 496
touran 361
lupo 332
sharan 230
caddy 190
beetle 166
other 137
fox 75
bora 62
touareg 58
jetta 47
scirocco 27
tiguan 24
phaeton 20
eos 8
cc 8
up 4
amarok 1
dtype: int64
model_col_notna_top_vw['model'].unique()
vw0 = ['golf', 'polo', 'passat', 'transporter', 'touran', 'lupo', 'sharan',
'caddy', 'beetle', 'fox', 'bora', 'touareg', 'jetta',
'scirocco', 'tiguan', 'phaeton', 'eos', 'cc', 'up', 'amarok']
model_col_notna_top['model'].unique()
bmw0 = ['3er', '5er', 'x_reihe', '1er', '7er', 'z_reihe', 'm_reihe' '6er', 'i3']
merc0 = ['c_klasse', 'e_klasse', 'a_klasse', 'clk', 'sprinter', 'vito', 's_klasse', 'slk', 'm_klasse',
'b_klasse', 'sl', 'viano', 'v_klasse', 'cl', 'g_klasse', 'glk', 'gl']
opel0 = ['corsa', 'astra', 'vectra', 'zafira', 'omega', 'kadett', 'tigra', 'agila', 'meriva', 'combo',
'signum', 'vivaro', 'calibra', 'insignia', 'antara']
model_col_notna_top_middle['model'].unique()
audi0 = ['a4', 'a3', 'a6', '80', 'a8', 'a2','100', 'tt', 'a5', 'q7', '90', 'a1', '200', 'q5', 'q3']
fiat0 = ['punto', 'seicento', 'ducato', 'panda', 'stilo', '500', 'bravo', 'doblo', 'croma']
ford0 = ['fiesta', 'focus', 'ka', 'mondeo', 'galaxy', 'escort', 'transit', 'c_max', 'mustang',
'fusion', 'kuga', 's_max', 'b_max']
mazda0 = ['3_reihe', '6_reihe', 'mx_reihe', '5_reihe', '1_reihe', 'rx_reihe', 'cx_reihe']
peu0 = '2_reihe', '3_reihe', '1_reihe', '4_reihe', '5_reihe',
ren0 = ['twingo', 'clio', 'megane', 'scenic', 'kangoo', 'laguna', 'espace', 'modus', 'r19']
smart0 = ['fortwo', 'forfour', 'roadster']
vw_model = df[(df['brand'] == 'volkswagen') & (df['power'] == 0) & (df['model'].isin(vw0))]
pivot_table_min_vw = pd.pivot_table(vw_model, index = 'model', columns = 'vehicletype', values = 'registrationyear', aggfunc = ('min'))
pivot_table_min_vw.plot(kind = 'bar', figsize = (12,8))
plt.title("Volkswagen Models with 0hp Engines")
plt.ylim(1892, 2026)
plt.grid(True)
plt.tight_layout()
plt.show()
df['datecreated'] = pd.to_datetime(df['datecreated'])
# Cars shouldn't be registered after datecreated
plt.scatter(df['registrationyear'], df['datecreated'].dt.year, alpha=0.3)
plt.xlabel("Registration Year")
plt.ylabel("Ad Creation Year")
plt.title("Registration Year vs Ad Creation Year")
plt.show()
# The latest registrationyear should be 2016
mask = (df['registrationyear'] > 2016) & (df['registration_correction'] != "Y: too late")
df.loc[mask,['registration_correction']] = "Y: too late"
df[(df['registration_correction'] == "Y: too late") & (df['brand'] == 'volkswagen')].value_counts(subset = 'model')
model golf 1612 polo 613 passat 303 lupo 227 touran 209 transporter 155 caddy 94 sharan 78 beetle 51 bora 30 other 29 fox 27 jetta 16 scirocco 15 touareg 13 tiguan 12 eos 10 up 7 phaeton 6 cc 5 amarok 1 dtype: int64
df[(df['postalcode']) & (df['model'] == 'bora')].value_counts(subset = 'postalcode').head(60)
postalcode 56727 5 12051 5 22589 4 23845 4 6749 4 9111 4 30179 4 47167 4 13359 4 53773 4 47475 3 26441 3 74523 3 12157 3 32683 3 59269 3 38259 3 38531 3 45663 3 21075 3 15517 3 84307 3 44805 3 32791 3 44145 3 27283 3 94447 3 31275 3 1219 3 49835 3 4639 3 33719 3 31167 2 30851 2 34117 2 76437 2 37359 2 31137 2 21337 2 27751 2 75181 2 66333 2 65779 2 33609 2 1169 2 65451 2 65199 2 40219 2 63743 2 76149 2 67059 2 34431 2 24247 2 27419 2 27793 2 28213 2 28325 2 26835 2 35415 2 56841 2 dtype: int64
del model_col
del model_col_p0
del brand_p0
del brand_p0_rows_bottom
del brand_p0_rows_mbottom
del brand_p0_rows_middle
del brand_p0_rows_top
del brand_p0_rows_vw
del model_col_notna_top_bottom
del model_col_notna_top_mbottom
del model_col_notna_top_middle
del model_col_notna_top
del model_col_notna_top_vw
del brand_p0_notna_top_bottom
del brand_p0_notna_top_mbottom
del brand_p0_notna_top_middle
del brand_p0_notna_top
del brand_p0_notna_top_vw
gc.collect()
32347
# These postal codes are German; additionally the Bora was replaced (in Germany) by Jetta after 2005
df[(df['model'] == 'bora') & (df['registrationyear'] > 2005)]
bora_to_jetta = (df['model'] == 'bora') & (df['registrationyear'] > 2005)
df.loc[bora_to_jetta,['model']] = 'jetta'
df.loc[bora_to_jetta,['registrationyear']] = 2016
df.loc[bora_to_jetta,['registration_correction']] = 'N'
df.loc[[18669]]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 18669 | 03/04/2016 10:45 | 2499 | NaN | 2016.0 | manual | 101 | jetta | 150000 | 6 | NaN | volkswagen | no | 2016-03-04 | 0 | 99097 | 07/04/2016 11:44 | N |
# The registration date cannot supercede the datecreated year
df[(df['model'] == 'jetta') & (df['registrationyear'] > 2015)]
# All dates are close to 2016, can assume simple error
jetta16 = (df['model'] == 'jetta') & (df['registrationyear'] > 2016)
df.loc[jetta16, ['registrationyear']] = 2016
df.loc[jetta16, ['registration_correction']] = "N"
df[(df['model'] == 'jetta') & (df['registrationyear'] == 2016)]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12573 | 19/03/2016 16:45 | 5000 | NaN | 2016.0 | manual | 115 | jetta | 150000 | 0 | NaN | volkswagen | no | 2016-03-19 | 0 | 99310 | 20/03/2016 18:46 | N |
| 13726 | 14/03/2016 18:47 | 0 | NaN | 2016.0 | auto | 0 | jetta | 150000 | 5 | gasoline | volkswagen | no | 2016-03-14 | 0 | 25554 | 06/04/2016 03:16 | N |
| 18669 | 03/04/2016 10:45 | 2499 | NaN | 2016.0 | manual | 101 | jetta | 150000 | 6 | NaN | volkswagen | no | 2016-03-04 | 0 | 99097 | 07/04/2016 11:44 | N |
| 19856 | 07/03/2016 13:55 | 2150 | NaN | 2016.0 | manual | 75 | jetta | 150000 | 2 | lpg | volkswagen | NaN | 2016-07-03 | 0 | 64354 | 23/03/2016 05:20 | N |
| 42319 | 02/04/2016 13:55 | 3599 | NaN | 2016.0 | manual | 101 | jetta | 150000 | 12 | NaN | volkswagen | no | 2016-02-04 | 0 | 33334 | 06/04/2016 12:16 | N |
| 69206 | 04/04/2016 21:42 | 1790 | NaN | 2016.0 | manual | 101 | jetta | 150000 | 1 | petrol | volkswagen | no | 2016-04-04 | 0 | 10117 | 07/04/2016 00:15 | N |
| 69702 | 31/03/2016 22:54 | 1499 | NaN | 2016.0 | NaN | 0 | jetta | 150000 | 8 | NaN | volkswagen | NaN | 2016-03-31 | 0 | 39112 | 01/04/2016 01:42 | N |
| 73765 | 11/03/2016 14:57 | 1700 | NaN | 2016.0 | manual | 75 | jetta | 150000 | 3 | NaN | volkswagen | no | 2016-11-03 | 0 | 55270 | 07/04/2016 13:15 | N |
| 92577 | 24/03/2016 21:57 | 4790 | NaN | 2016.0 | manual | 204 | jetta | 150000 | 1 | NaN | volkswagen | no | 2016-03-24 | 0 | 8056 | 05/04/2016 15:45 | N |
| 109125 | 23/03/2016 14:55 | 2499 | NaN | 2016.0 | manual | 116 | jetta | 150000 | 4 | petrol | volkswagen | NaN | 2016-03-23 | 0 | 38154 | 01/04/2016 20:17 | N |
| 111319 | 17/03/2016 00:32 | 1875 | NaN | 2016.0 | auto | 90 | jetta | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-17 | 0 | 56340 | 05/04/2016 22:44 | N |
| 115258 | 29/03/2016 15:46 | 3300 | NaN | 2016.0 | NaN | 150 | jetta | 150000 | 11 | petrol | volkswagen | no | 2016-03-29 | 0 | 96199 | 06/04/2016 01:17 | N |
| 115695 | 16/03/2016 16:48 | 3000 | NaN | 2016.0 | auto | 105 | jetta | 150000 | 8 | NaN | volkswagen | yes | 2016-03-16 | 0 | 13051 | 16/03/2016 16:48 | N |
| 116347 | 04/04/2016 12:50 | 2800 | NaN | 2016.0 | manual | 70 | jetta | 70000 | 2 | NaN | volkswagen | no | 2016-04-04 | 0 | 85459 | 06/04/2016 14:16 | N |
| 120275 | 29/03/2016 19:56 | 6800 | NaN | 2016.0 | manual | 102 | jetta | 60000 | 3 | petrol | volkswagen | no | 2016-03-29 | 0 | 17213 | 04/04/2016 05:17 | N |
| 132768 | 01/04/2016 09:51 | 1799 | NaN | 2016.0 | manual | 0 | jetta | 150000 | 5 | petrol | volkswagen | NaN | 2016-01-04 | 0 | 56727 | 01/04/2016 10:44 | N |
| 138919 | 07/03/2016 16:59 | 4300 | NaN | 2016.0 | manual | 75 | jetta | 150000 | 8 | NaN | volkswagen | no | 2016-07-03 | 0 | 12439 | 17/03/2016 06:45 | N |
| 142694 | 07/03/2016 16:48 | 0 | NaN | 2016.0 | auto | 90 | jetta | 20000 | 2 | NaN | volkswagen | NaN | 2016-07-03 | 0 | 13587 | 09/03/2016 12:45 | N |
| 145467 | 02/04/2016 20:53 | 3050 | NaN | 2016.0 | manual | 0 | jetta | 150000 | 11 | petrol | volkswagen | no | 2016-02-04 | 0 | 35260 | 02/04/2016 21:41 | N |
| 153881 | 15/03/2016 21:45 | 2888 | NaN | 2016.0 | manual | 110 | jetta | 150000 | 5 | gasoline | volkswagen | no | 2016-03-15 | 0 | 15806 | 19/03/2016 18:44 | N |
| 162111 | 16/03/2016 11:49 | 3999 | NaN | 2016.0 | manual | 160 | jetta | 125000 | 0 | petrol | volkswagen | no | 2016-03-16 | 0 | 38458 | 22/03/2016 15:45 | N |
| 170553 | 26/03/2016 10:55 | 1200 | NaN | 2016.0 | manual | 150 | jetta | 150000 | 12 | NaN | volkswagen | no | 2016-03-26 | 0 | 57250 | 05/04/2016 22:45 | N |
| 184846 | 31/03/2016 10:50 | 1850 | NaN | 2016.0 | manual | 0 | jetta | 150000 | 5 | petrol | volkswagen | NaN | 2016-03-31 | 0 | 56727 | 31/03/2016 10:50 | N |
| 185565 | 01/04/2016 18:53 | 2200 | NaN | 2016.0 | NaN | 0 | jetta | 150000 | 0 | petrol | volkswagen | NaN | 2016-01-04 | 0 | 26441 | 01/04/2016 18:53 | N |
| 189535 | 15/03/2016 22:37 | 1500 | NaN | 2016.0 | manual | 115 | jetta | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-15 | 0 | 9387 | 16/03/2016 00:41 | N |
| 199567 | 20/03/2016 16:50 | 0 | NaN | 2016.0 | manual | 90 | jetta | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-20 | 0 | 99867 | 25/03/2016 17:22 | N |
| 208403 | 02/04/2016 08:55 | 1600 | NaN | 2016.0 | manual | 0 | jetta | 150000 | 5 | petrol | volkswagen | NaN | 2016-02-04 | 0 | 56727 | 02/04/2016 09:46 | N |
| 219654 | 11/03/2016 10:37 | 5500 | NaN | 2016.0 | manual | 150 | jetta | 150000 | 7 | NaN | volkswagen | NaN | 2016-11-03 | 0 | 2763 | 07/04/2016 01:45 | N |
| 223717 | 27/03/2016 18:56 | 0 | NaN | 2016.0 | manual | 90 | jetta | 150000 | 0 | petrol | volkswagen | no | 2016-03-27 | 0 | 17109 | 31/03/2016 00:44 | N |
| 227749 | 30/03/2016 14:50 | 1590 | NaN | 2016.0 | manual | 150 | jetta | 150000 | 8 | petrol | volkswagen | NaN | 2016-03-30 | 0 | 87629 | 03/04/2016 04:44 | N |
| 228488 | 19/03/2016 13:38 | 2300 | NaN | 2016.0 | manual | 130 | jetta | 150000 | 2 | NaN | volkswagen | no | 2016-03-19 | 0 | 2625 | 02/04/2016 19:15 | N |
| 231870 | 29/03/2016 03:02 | 2800 | NaN | 2016.0 | auto | 69 | jetta | 150000 | 4 | petrol | volkswagen | no | 2016-03-29 | 0 | 38486 | 05/04/2016 17:44 | N |
| 234824 | 17/03/2016 11:54 | 2850 | NaN | 2016.0 | auto | 105 | jetta | 150000 | 8 | NaN | volkswagen | yes | 2016-03-17 | 0 | 13051 | 17/03/2016 11:54 | N |
| 237026 | 16/03/2016 21:46 | 3000 | NaN | 2016.0 | auto | 105 | jetta | 150000 | 8 | NaN | volkswagen | yes | 2016-03-16 | 0 | 13051 | 16/03/2016 21:46 | N |
| 239203 | 19/03/2016 22:45 | 2350 | NaN | 2016.0 | manual | 101 | jetta | 150000 | 10 | NaN | volkswagen | no | 2016-03-19 | 0 | 33100 | 07/04/2016 12:17 | N |
| 241686 | 30/03/2016 14:36 | 3500 | NaN | 2016.0 | manual | 115 | jetta | 150000 | 5 | NaN | volkswagen | no | 2016-03-30 | 0 | 49356 | 07/04/2016 06:15 | N |
| 248831 | 09/03/2016 17:52 | 1700 | NaN | 2016.0 | manual | 150 | jetta | 150000 | 5 | NaN | volkswagen | no | 2016-09-03 | 0 | 42109 | 12/03/2016 18:15 | N |
| 259038 | 08/03/2016 10:50 | 1950 | NaN | 2016.0 | manual | 101 | jetta | 150000 | 1 | lpg | volkswagen | no | 2016-08-03 | 0 | 47167 | 16/03/2016 20:48 | N |
| 261773 | 02/04/2016 15:55 | 1899 | NaN | 2016.0 | auto | 0 | jetta | 150000 | 5 | petrol | volkswagen | no | 2016-02-04 | 0 | 6193 | 02/04/2016 15:55 | N |
| 265058 | 15/03/2016 13:54 | 0 | NaN | 2016.0 | manual | 0 | jetta | 100000 | 2 | petrol | volkswagen | NaN | 2016-03-15 | 0 | 94491 | 31/03/2016 11:18 | N |
| 265331 | 03/04/2016 19:58 | 599 | NaN | 2016.0 | manual | 75 | jetta | 150000 | 0 | petrol | volkswagen | no | 2016-03-04 | 0 | 4668 | 05/04/2016 20:45 | N |
| 278004 | 17/03/2016 15:48 | 3100 | NaN | 2016.0 | auto | 90 | jetta | 125000 | 10 | NaN | volkswagen | NaN | 2016-03-17 | 0 | 52393 | 19/03/2016 15:44 | N |
| 290135 | 13/03/2016 19:38 | 3000 | NaN | 2016.0 | manual | 204 | jetta | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-13 | 0 | 36369 | 17/03/2016 13:47 | N |
| 297474 | 14/03/2016 09:49 | 4100 | NaN | 2016.0 | auto | 105 | jetta | 150000 | 5 | NaN | volkswagen | no | 2016-03-14 | 0 | 45721 | 14/03/2016 09:49 | N |
| 304913 | 24/03/2016 22:46 | 4790 | NaN | 2016.0 | manual | 204 | jetta | 150000 | 1 | NaN | volkswagen | no | 2016-03-24 | 0 | 8056 | 05/04/2016 17:44 | N |
| 313465 | 21/03/2016 12:50 | 1200 | NaN | 2016.0 | manual | 115 | jetta | 150000 | 0 | NaN | volkswagen | NaN | 2016-03-21 | 0 | 84508 | 06/04/2016 07:45 | N |
| 315015 | 02/04/2016 22:57 | 3990 | NaN | 2016.0 | auto | 90 | jetta | 100000 | 8 | NaN | volkswagen | no | 2016-02-04 | 0 | 77656 | 07/04/2016 03:45 | N |
| 317772 | 10/03/2016 18:38 | 2500 | NaN | 2016.0 | manual | 75 | jetta | 150000 | 5 | petrol | volkswagen | NaN | 2016-10-03 | 0 | 6667 | 05/04/2016 21:18 | N |
| 317964 | 26/03/2016 08:54 | 0 | NaN | 2016.0 | manual | 90 | jetta | 150000 | 0 | petrol | volkswagen | no | 2016-03-26 | 0 | 17109 | 31/03/2016 03:46 | N |
| 324089 | 31/03/2016 18:56 | 3299 | NaN | 2016.0 | manual | 90 | jetta | 150000 | 2 | NaN | volkswagen | NaN | 2016-03-31 | 0 | 21481 | 06/04/2016 13:15 | N |
| 327076 | 26/03/2016 11:54 | 1600 | NaN | 2016.0 | NaN | 90 | jetta | 125000 | 3 | NaN | volkswagen | NaN | 2016-03-26 | 0 | 52393 | 06/04/2016 00:15 | N |
| 337067 | 26/03/2016 14:52 | 1450 | NaN | 2016.0 | manual | 0 | jetta | 150000 | 7 | petrol | volkswagen | NaN | 2016-03-26 | 0 | 47137 | 31/03/2016 10:17 | N |
| 341498 | 16/03/2016 16:44 | 750 | NaN | 2016.0 | manual | 75 | jetta | 150000 | 0 | petrol | volkswagen | NaN | 2016-03-16 | 0 | 35274 | 28/03/2016 15:46 | N |
| 350647 | 27/03/2016 20:49 | 4999 | NaN | 2016.0 | manual | 115 | jetta | 150000 | 11 | gasoline | volkswagen | NaN | 2016-03-27 | 0 | 91486 | 27/03/2016 20:49 | N |
# Find the Brand and Model's where the minimum price is not 0
brands_with_price = df.groupby(['brand','model'])['price'].min()
brands_with_price[brands_with_price != 0]
brand model
audi q5 65
bmw i3 250
chevrolet aveo 350
chrysler crossfire 3333
grand 100
dacia lodgy 4900
daewoo kalos 250
daihatsu charade 150
materia 2800
terios 750
fiat croma 350
ford b_max 5199
kia picanto 500
lada kalina 500
lancia elefantino 80
kappa 50
other 1
land_rover other 550
range_rover_evoque 12500
range_rover_sport 1750
serie_2 6300
mercedes_benz glk 30
nissan juke 1
rover defender 550
discovery 2800
rangerover 1050
seat exeo 5900
mii 2500
skoda citigo 3690
yeti 1750
suzuki jimny 1200
toyota auris 1
volvo v60 1000
Name: price, dtype: int64
pivot = pd.pivot_table(df, index = 'model', columns = 'brand', values = 'price')
pivot.boxplot(vert = False, figsize = (12,8))
<AxesSubplot:>
chevy = df[df['brand'] == 'chevrolet']
chevy_pivot = pd.pivot_table(chevy, index = 'registrationyear', columns = 'model', values = 'price')
chevy_pivot
chevy_pivot.boxplot(vert = False)
<AxesSubplot:>
captiva = (df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet')
df.loc[captiva,['model']] = 'captiva'
df[(df['vehicletype'] == 'suv') & (df['registrationyear'] > 2005) & (df['brand'] == 'chevrolet')]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2670 | 07/03/2016 23:56 | 9199 | suv | 2006.0 | manual | 150 | captiva | 125000 | 10 | gasoline | chevrolet | no | 2016-07-03 | 0 | 59821 | 17/03/2016 21:45 | N |
| 7816 | 02/04/2016 14:45 | 8600 | suv | 2008.0 | auto | 150 | captiva | 5000 | 9 | gasoline | chevrolet | no | 2016-02-04 | 0 | 33602 | 06/04/2016 13:15 | N |
| 9501 | 03/04/2016 13:46 | 14950 | suv | 2011.0 | auto | 184 | captiva | 90000 | 9 | gasoline | chevrolet | no | 2016-03-04 | 0 | 47918 | 05/04/2016 12:45 | N |
| 10054 | 09/03/2016 10:53 | 9500 | suv | 2007.0 | auto | 150 | captiva | 150000 | 9 | gasoline | chevrolet | no | 2016-09-03 | 0 | 39343 | 05/04/2016 14:46 | N |
| 10933 | 22/03/2016 23:59 | 8300 | suv | 2008.0 | manual | 136 | captiva | 30000 | 5 | petrol | chevrolet | no | 2016-03-22 | 0 | 71065 | 30/03/2016 14:18 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 337492 | 23/03/2016 13:55 | 18500 | suv | 2013.0 | manual | 167 | captiva | 20000 | 1 | petrol | chevrolet | no | 2016-03-23 | 0 | 99734 | 05/04/2016 15:18 | N |
| 341978 | 03/04/2016 16:50 | 14999 | suv | 2012.0 | manual | 163 | captiva | 70000 | 4 | gasoline | chevrolet | no | 2016-03-04 | 0 | 26209 | 05/04/2016 16:46 | N |
| 342687 | 31/03/2016 21:58 | 15900 | suv | 2012.0 | auto | 184 | captiva | 80000 | 8 | gasoline | chevrolet | no | 2016-03-31 | 0 | 15831 | 06/04/2016 18:17 | N |
| 344562 | 10/03/2016 15:49 | 11990 | suv | 2011.0 | manual | 167 | captiva | 40000 | 11 | petrol | chevrolet | no | 2016-10-03 | 0 | 91452 | 21/03/2016 02:45 | N |
| 354111 | 16/03/2016 16:55 | 15700 | suv | 2012.0 | auto | 184 | captiva | 100000 | 3 | gasoline | chevrolet | no | 2016-03-16 | 0 | 46242 | 06/04/2016 21:47 | N |
186 rows × 17 columns
convertible = (df['brand'] == 'chevrolet') & (df['vehicletype'] == 'convertible')
df.loc[convertible,['model']] = 'other'
matiz68 = (df['brand'] == 'chevrolet') & (df['power'] == 68) & (df['price'] < 2600)
df.loc[matiz68,['model']] = 'matiz'
df.loc[matiz68,['vehicletype']] = 'small'
df[(df['brand'] == 'chevrolet') & (df['power'] == 68) & (df['price'] < 2600)]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82008 | 08/03/2016 22:44 | 2599 | small | 2008.0 | manual | 68 | matiz | 100000 | 8 | NaN | chevrolet | NaN | 2016-08-03 | 0 | 44145 | 14/03/2016 06:16 | N |
| 140254 | 22/03/2016 21:36 | 1200 | small | 2005.0 | manual | 68 | matiz | 90000 | 5 | petrol | chevrolet | NaN | 2016-03-22 | 0 | 4155 | 24/03/2016 07:15 | N |
| 205903 | 14/03/2016 19:41 | 1799 | small | 2008.0 | manual | 68 | matiz | 100000 | 5 | petrol | chevrolet | no | 2016-03-14 | 0 | 24816 | 06/04/2016 04:17 | N |
| 257625 | 23/03/2016 10:38 | 1500 | small | 2005.0 | manual | 68 | matiz | 150000 | 11 | lpg | chevrolet | NaN | 2016-03-23 | 0 | 41238 | 24/03/2016 17:17 | N |
| 353189 | 19/03/2016 13:37 | 1200 | small | 2016.0 | manual | 68 | matiz | 90000 | 5 | petrol | chevrolet | NaN | 2016-03-19 | 0 | 4155 | 21/03/2016 17:50 | N |
matiz52 = (df['brand'] == 'chevrolet') & (df['power'] == 52)
df.loc[matiz52,['model']] = 'matiz'
df.loc[matiz52,['vehicletype']] = 'small'
df[(df['brand'] == 'chevrolet') & (df['power'] == 52)]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 373 | 02/04/2016 12:39 | 1350 | small | 2005.0 | manual | 52 | matiz | 150000 | 6 | petrol | chevrolet | yes | 2016-02-04 | 0 | 91207 | 06/04/2016 10:17 | N |
| 2263 | 27/03/2016 19:55 | 2399 | small | 2016.0 | manual | 52 | matiz | 80000 | 7 | petrol | chevrolet | NaN | 2016-03-27 | 0 | 33605 | 05/04/2016 18:45 | N |
| 2820 | 26/03/2016 20:47 | 3350 | small | 2010.0 | manual | 52 | matiz | 80000 | 2 | petrol | chevrolet | no | 2016-03-26 | 0 | 18273 | 06/04/2016 11:17 | N |
| 5636 | 30/03/2016 08:55 | 3650 | small | 2009.0 | manual | 52 | matiz | 50000 | 7 | petrol | chevrolet | no | 2016-03-30 | 0 | 26789 | 30/03/2016 08:55 | N |
| 7123 | 04/04/2016 18:39 | 2500 | small | 2008.0 | manual | 52 | matiz | 125000 | 12 | petrol | chevrolet | no | 2016-04-04 | 0 | 21493 | 06/04/2016 20:44 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340075 | 17/03/2016 21:37 | 4999 | small | 2010.0 | auto | 52 | matiz | 30000 | 3 | petrol | chevrolet | no | 2016-03-17 | 0 | 45329 | 17/03/2016 22:40 | N |
| 340549 | 29/03/2016 15:57 | 1599 | small | 2009.0 | manual | 52 | matiz | 80000 | 5 | petrol | chevrolet | no | 2016-03-29 | 0 | 20357 | 06/04/2016 02:15 | N |
| 344585 | 13/03/2016 17:50 | 2100 | small | 2009.0 | manual | 52 | matiz | 125000 | 11 | petrol | chevrolet | no | 2016-03-13 | 0 | 22869 | 28/03/2016 14:16 | N |
| 349474 | 08/03/2016 13:25 | 2600 | small | 2009.0 | manual | 52 | matiz | 50000 | 3 | petrol | chevrolet | no | 2016-08-03 | 0 | 65719 | 11/03/2016 09:45 | N |
| 349800 | 01/04/2016 22:38 | 1950 | small | 2008.0 | manual | 52 | matiz | 60000 | 9 | petrol | chevrolet | no | 2016-01-04 | 0 | 42369 | 01/04/2016 23:41 | N |
101 rows × 17 columns
matiz67 = (df['brand'] == 'chevrolet') & (df['power'] == 67)
df.loc[matiz67,['model']] = 'matiz'
df.loc[matiz67,['vehicletype']] = 'small'
df[(df['brand'] == 'chevrolet') & (df['power'] == 67)]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1981 | 27/03/2016 18:43 | 2990 | small | 2007.0 | manual | 67 | matiz | 125000 | 4 | lpg | chevrolet | no | 2016-03-27 | 0 | 72108 | 05/04/2016 15:15 | N |
| 3769 | 01/04/2016 15:53 | 1500 | small | 2016.0 | manual | 67 | matiz | 125000 | 10 | NaN | chevrolet | NaN | 2016-01-04 | 0 | 4158 | 07/04/2016 13:50 | N |
| 5215 | 26/03/2016 08:55 | 2900 | small | 2010.0 | manual | 67 | matiz | 80000 | 4 | petrol | chevrolet | no | 2016-03-26 | 0 | 25421 | 03/04/2016 19:47 | N |
| 7757 | 21/03/2016 09:52 | 3750 | small | 2007.0 | manual | 67 | matiz | 70000 | 10 | lpg | chevrolet | no | 2016-03-21 | 0 | 53945 | 06/04/2016 02:45 | N |
| 9006 | 14/03/2016 11:38 | 2750 | small | 2007.0 | manual | 67 | matiz | 70000 | 10 | petrol | chevrolet | no | 2016-03-14 | 0 | 21029 | 07/04/2016 12:45 | N |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340326 | 02/04/2016 22:51 | 2150 | small | 2007.0 | manual | 67 | matiz | 150000 | 12 | petrol | chevrolet | no | 2016-02-04 | 0 | 31863 | 07/04/2016 00:45 | N |
| 344984 | 26/03/2016 22:54 | 2100 | small | 2007.0 | manual | 67 | matiz | 125000 | 6 | petrol | chevrolet | no | 2016-03-26 | 0 | 48565 | 04/04/2016 22:47 | N |
| 348552 | 04/04/2016 13:46 | 2250 | small | 2006.0 | manual | 67 | matiz | 150000 | 7 | lpg | chevrolet | no | 2016-04-04 | 0 | 33397 | 06/04/2016 14:46 | N |
| 351693 | 28/03/2016 17:41 | 1100 | small | 2006.0 | manual | 67 | matiz | 150000 | 6 | petrol | chevrolet | no | 2016-03-28 | 0 | 46537 | 06/04/2016 23:15 | N |
| 352283 | 12/03/2016 15:46 | 1950 | small | 2007.0 | manual | 67 | matiz | 90000 | 8 | petrol | chevrolet | no | 2016-12-03 | 0 | 48529 | 15/03/2016 21:16 | N |
91 rows × 17 columns
peugeot = df[df['brand'] == 'peugeot']
peugeot_pivot = pd.pivot_table(peugeot,index = 'power', columns = 'model', values = 'price')
df[(df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57, 454]))]
re_1 = (df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57,454]))
df.loc[re_1,['vehicletype']] = 'small'
df.loc[re_1,['model']] = '1_reihe'
df[(df['brand'] == 'peugeot') & (df['power'].isin([7,33,42,43,48,57,454]))]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 44179 | 02/04/2016 17:52 | 500 | small | 1998.0 | auto | 7 | 1_reihe | 100000 | 11 | petrol | peugeot | no | 2016-02-04 | 0 | 66271 | 02/04/2016 17:52 | N |
| 154470 | 07/03/2016 10:52 | 100 | small | 1995.0 | manual | 42 | 1_reihe | 150000 | 6 | petrol | peugeot | NaN | 2016-07-03 | 0 | 1665 | 15/03/2016 22:16 | N |
| 174795 | 10/03/2016 23:44 | 150 | small | 1997.0 | manual | 33 | 1_reihe | 150000 | 11 | petrol | peugeot | yes | 2016-10-03 | 0 | 66333 | 11/03/2016 12:17 | N |
| 186556 | 20/03/2016 16:55 | 430 | small | 2016.0 | NaN | 33 | 1_reihe | 150000 | 9 | petrol | peugeot | NaN | 2016-03-20 | 0 | 73525 | 04/04/2016 20:44 | N |
| 191097 | 23/03/2016 22:51 | 0 | small | 1997.0 | manual | 33 | 1_reihe | 125000 | 6 | NaN | peugeot | yes | 2016-03-23 | 0 | 86343 | 06/04/2016 06:45 | N |
| 204925 | 29/03/2016 15:45 | 850 | small | 1997.0 | manual | 57 | 1_reihe | 150000 | 2 | petrol | peugeot | no | 2016-03-29 | 0 | 16909 | 06/04/2016 01:16 | N |
| 210942 | 30/03/2016 15:51 | 700 | small | 1998.0 | manual | 454 | 1_reihe | 150000 | 8 | petrol | peugeot | NaN | 2016-03-30 | 0 | 85598 | 30/03/2016 15:51 | N |
| 262687 | 05/03/2016 16:52 | 0 | small | 1996.0 | manual | 48 | 1_reihe | 150000 | 7 | petrol | peugeot | yes | 2016-05-03 | 0 | 26441 | 24/03/2016 18:45 | N |
| 314981 | 20/03/2016 04:02 | 700 | small | 2017.0 | manual | 33 | 1_reihe | 150000 | 7 | petrol | peugeot | no | 2016-03-20 | 0 | 28759 | 23/03/2016 22:17 | Y: too late |
| 323988 | 10/03/2016 22:50 | 1033 | small | 1996.0 | manual | 43 | 1_reihe | 150000 | 10 | petrol | peugeot | no | 2016-10-03 | 0 | 42277 | 24/03/2016 20:18 | N |
coupe = df[(df['vehicletype'] == 'coupe') & (df['price'] > 0)]
suv = df[(df['vehicletype'] == 'suv') & (df['price'] > 0)]
small = df[(df['vehicletype'] == 'small') & (df['price'] > 0)]
sedan = df[(df['vehicletype'] == 'sedan') & (df['price'] > 0)]
convertible = df[(df['vehicletype'] == 'convertible') & (df['price'] > 0)]
bus = df[(df['vehicletype'] == 'bus') & (df['price'] > 0)]
wagon = df[(df['vehicletype'] == 'wagon') & (df['price'] > 0)]
wagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand')
<AxesSubplot:title={'center':'Number of Wagons per Brand'}>
wagon.groupby('brand')['price'].mean().sort_values(ascending=False).plot(kind='bar', figsize=(10,5), title='Average Wagon Price per Brand')
<AxesSubplot:title={'center':'Average Wagon Price per Brand'}, xlabel='brand'>
plt.figure(figsize=(14,16))
sns.boxplot(data=wagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand')
plt.grid()
plt.show()
Wagon Type Vehicles Against Price
| Brand | Vehicle Type | (~)Count | Avg Price | Distribution (25 - 75) |
|---|---|---|---|---|
| volkswagen | Wagon | 12,500 | 5,000 | 1,250 - 7,000 |
| audi | Wagon | 11,000 | 7,000 | 2,500 - 11,000 |
| bmw | Wagon | 8,000 | 7,000 | 2,300 - 9,500 |
| opel | Wagon | 7,000 | 3,500 | 1,000 - 4,500 |
| mercedes_benz | Wagon | 6,500 | 6,000 | 1,500 - 8,500 |
| ford | Wagon | 5,900 | 6,000 | 1,500 - 8,000 |
| skoda | Wagon | 3,000 | 6,500 | 2,000 - 9,000 |
| volvo | Wagon | 2,200 | 5,500 | 2,000 - 7,500 |
| renault | Wagon | 2,000 | 3,000 | 1,000 - 4,000 |
| peugeot | Wagon | 1,800 | 4,900 | 1,500 - 6,500 |
| mazda | Wagon | 1,000 | 4,800 | 2,000 - 6,500 |
| toyota | Wagon | 800 | 4,700 | 2,000 - 6,500 |
| alfa_romeo | Wagon | 600 | 4,400 | 1,500 - 6,000 |
| fiat | Wagon | 500 | 2,200 | 1,000 - 3,000 |
| seat | Wagon | 500 | 4,000 | 1,500 - 5,500 |
| nissan | Wagon | 400 | 3,800 | 1,500 - 5,000 |
| citroen | Wagon | 400 | 3,700 | 1,500 - 5,000 |
| mitsubishi | Wagon | 300 | 1,800 | 800 - 2,500 |
| dacia | Wagon | 300 | 3,700 | 2,000 - 5,000 |
| chevrolet | Wagon | 200 | 3,500 | 1,500 - 5,000 |
| hyundai | Wagon | 200 | 11,500 | 6,000 - 15,000 |
| kia | Wagon | 200 | 3,300 | 1,500 - 4,500 |
| mini | Wagon | 100 | 8,000 | 4,000 - 11,000 |
| subaru | Wagon | <100 | 4,000 | 2,000 - 5,500 |
| honda | Wagon | <100 | 3,000 | 1,500 - 4,000 |
| chrysler | Wagon | <100 | 2,800 | 1,000 - 4,000 |
| saab | Wagon | <100 | 2,800 | 1,000 - 4,000 |
| suzuki | Wagon | <100 | 2,300 | 1,000 - 3,000 |
| smart | Wagon | <100 | 2,200 | 1,000 - 3,000 |
| lancia | Wagon | <100 | 2,000 | 800 - 3,000 |
| daewoo | Wagon | <100 | 900 | 500 - 1,200 |
| jaguar | Wagon | <100 | 1,800 | 1,000 - 2,500 |
| land_rover | Wagon | <100 | 2,900 | 1,500 - 4,000 |
| lada | Wagon | <100 | 1,700 | 800 - 2,500 |
| rover | Wagon | <100 | 1,600 | 800 - 2,200 |
| trabant | Wagon | <100 | 1,800 | 1,000 - 2,500 |
df[(df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & \
(df['price'] < 7000) & (df['registrationyear'] > 1996) & (df['registrationyear'] < 1999) & (df['power'].isin([150]))]
passat = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 1999) & (df['power'].isin([150]))
df.loc[passat,['model']] = 'passat'
passat1 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] == 1991) & \
(df['power'].isin([90,136]))
df.loc[passat1,['model']] = 'passat'
passat2 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'] == 1992) & \
(df['model'].isna())
df.loc[passat2,['model']] = 'passat'
passat3 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & \
(df['registrationyear'].isin([1982,1993,1994])) & (df['model'].isna())
df.loc[passat3,['model']] = 'passat'
passat4 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'].isin([1996])) & \
(df['power'].isin([174])) & (df['model'].isna())
df.loc[passat4,['model']] = 'passat'
passat5 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['power'].isin([125])) & (df['price'] > 1250) & \
(df['price'] < 7000)
df.loc[passat5,['model']] = 'passat'
passat6 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['power'].isin([110,193])) & (df['price'] > 1250) & (df['price'] < 7000) & \
(df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[passat6, ['model']] = 'passat'
golf = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['price'] > 1250) & (df['price'] < 7000) & (df['registrationyear'].isin([1996])) & \
(df['power'].isin([75,110])) & (df['model'].isna())
df.loc[golf,['model']] = 'golf'
passat140 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['registrationyear'] > 2004) & \
(df['registrationyear'] < 2007) & (df['power'].isin([140]))
df.loc[passat140,['model']] = 'passat'
golf90 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2000) & \
(df['registrationyear'] < 2005) & (df['power'].isin([90]))
df.loc[golf90,['model']] = 'golf'
passat90 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1985) & \
(df['registrationyear'] < 1993) & (df['power'].isin([90]))
df.loc[passat90,['model']] = 'passat'
golf75 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1993) & \
(df['registrationyear'] < 1995) & (df['power'].isin([75]))
df.loc[golf75,['model']] = 'golf'
golf7502 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2001) & \
(df['registrationyear'] < 2003) & (df['power'].isin([75]))
df.loc[golf7502,['model']] = 'golf'
passat105 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1995) & \
(df['registrationyear'] < 1998) & (df['power'].isin([105]))
df.loc[passat105,['model']] = 'passat'
passat131 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1999) & \
(df['registrationyear'] < 2002) & (df['power'].isin([131]))
df.loc[passat131,['model']] = 'passat'
passat116 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1989) \
& (df['registrationyear'] < 1997) & (df['power'].isin([116]))
df.loc[passat116,['model']] = 'passat'
passat150 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1995) \
& (df['registrationyear'] < 2006) & (df['power'].isin([150]))
df.loc[passat150,['model']] = 'passat'
passat115 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) \
& (df['registrationyear'] < 1997) & (df['power'].isin([115]))
df.loc[passat115,['model']] = 'passat'
passat170 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2004) & \
(df['registrationyear'] < 2012) & (df['power'].isin([170]))
df.loc[passat170,['model']] = 'passat'
golf110 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 2013) & \
(df['registrationyear'] < 2017) & (df['power'].isin([60]))
df.loc[golf110,['model']] = 'golf'
golf60 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) & \
(df['registrationyear'] < 1996) & (df['power'].isin([60]))
df.loc[golf60,['model']] = 'golf'
polo60 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 2001) & (df['power'].isin([60]))
df.loc[polo60,['model']] = 'polo'
passat125 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1996) & \
(df['registrationyear'] < 2000) & (df['power'].isin([125]))
df.loc[passat125,['model']] = 'passat'
passat100 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1990) & \
(df['registrationyear'] < 2005) & (df['power'].isin([100]))
df.loc[passat100,['model']] = 'passat'
passat174 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1993) & \
(df['registrationyear'] < 1997) & (df['power'].isin([174]))
df.loc[passat174,['model']] = 'passat'
passat130 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1998) & \
(df['registrationyear'] < 2005) & (df['power'].isin([130]))
df.loc[passat130,['model']] = 'passat'
passat120 = (df['brand'] == 'volkswagen') & (df['model'].isna()) & (df['vehicletype'] == 'wagon') & (df['registrationyear'] > 1980) & (df['registrationyear'] < 2000) & (df['power'].isin([120]))
df.loc[passat120,['model']] = 'passat'
vw_small75 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1985,1992]))
df.loc[vw_small75,['model']] = 'golf'
vw_sedan75 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1993) & (df['registrationyear'] < 2007)
df.loc[vw_sedan75,['model']] = 'golf'
opel_sedan84 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1984]))
df.loc[opel_sedan84,['model']] = 'kadett'
opel_sedan94 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1994,1999,2000]))
df.loc[opel_sedan94,['model']] = 'astra'
opel_sedan04 = (df['brand'] == 'opel') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([2004,2008]))
df.loc[opel_sedan04,['model']] = 'corsa'
ford_sedan99 = (df['brand'] == 'ford') & (df['vehicletype'] == 'sedan') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1999,2001,2003]))
df.loc[ford_sedan99,['model']] = 'focus'
opel_wagon96 = (df['brand'] == 'opel') & (df['vehicletype'] == 'wagon') & (df['model'].isna()) & (df['power'].isin([75])) & (df['registrationyear'] > 1995) \
& (df['registrationyear'] < 2001)
df.loc[opel_wagon96,['model']] = 'astra'
opel_small01 = (df['brand'] == 'opel') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) \
& (df['registrationyear'].isin([2001, 2002, 2003, 2004, 2006, 2008]))
df.loc[opel_small01,['model']] = 'corsa'
renault_small91 = (df['brand'] == 'renault') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1990) & (df['registrationyear'] < 2001)
df.loc[renault_small91,['model']] = 'clio'
peugeot_small92 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1992]))
df.loc[peugeot_small92,['model']] = '1_reihe'
peugeot_small94 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'].isin([1994]))
df.loc[peugeot_small94,['model']] = '3_reihe'
peugeot_small00 = (df['brand'] == 'peugeot') & (df['vehicletype'] == 'small') & (df['model'].isna()) & (df['power'].isin([75])) & \
(df['registrationyear'] > 1999) & (df['registrationyear'] < 2010)
df.loc[peugeot_small00,['model']] = '2_reihe'
del vw_small75
del vw_sedan75
del opel_sedan84
del opel_sedan94
del opel_sedan04
del ford_sedan99
del opel_wagon96
del opel_small01
del renault_small91
del peugeot_small92
del peugeot_small94
del peugeot_small00
brand_power = df[(df['power'].isin([75,60,150,101,140,90,116,105,170,125,136,102])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')].value_counts(subset = 'brand')
brand_power.plot(kind = 'bar')
plt.title("Brands with Top HP counts")
plt.grid()
plt.show()
df[(df['brand'] == 'nissan')].value_counts(subset = 'model')
model micra 1756 other 702 primera 620 almera 584 qashqai 531 x_trail 206 note 130 juke 102 navara 98 dtype: int64
brand_power1 = df[(df['power'].isin([75,60])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power2 = df[(df['power'].isin([150,101])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power3 = df[(df['power'].isin([140,90])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power4 = df[(df['power'].isin([116,105])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power5 = df[(df['power'].isin([170,125])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
brand_power6 = df[(df['power'].isin([136,102])) & (df['model'].notna()) & (df['model'] != 'model') & \
(df['brand'] != 'sonstige_autos')]
top5_brand_power = ['volkswagen','opel','bmw','audi','ford']
over1000_brand_power = ['mercedes_benz', 'renault', 'peugeot', 'seat', 'skoda', 'fiat', 'citroen', 'honda', 'mazda', 'mini', 'nissan', 'mitsubishi', 'volvo']
under1000_brand_power = ['toyota', 'alfa_romeo', 'hyundai', 'kia', 'dacia', 'suzuki', 'chrysler', 'subaru', 'smart', 'chevrolet', 'saab', 'lancia',
'rover', 'jeep', 'daihatsu', 'daewoo', 'porsche', 'lada', 'land_rover', 'jaguar']
top5_brands = brand_power1[brand_power1['brand'].isin(top5_brand_power)]
top5_brands2 = brand_power2[brand_power2['brand'].isin(top5_brand_power)]
top5_brands3 = brand_power3[brand_power3['brand'].isin(top5_brand_power)]
top5_brands4 = brand_power4[brand_power4['brand'].isin(top5_brand_power)]
top5_brands5 = brand_power5[brand_power5['brand'].isin(top5_brand_power)]
top5_brands6 = brand_power6[brand_power6['brand'].isin(top5_brand_power)]
middle_brands = brand_power1[brand_power1['brand'].isin(over1000_brand_power)]
middle_brands2 = brand_power2[brand_power2['brand'].isin(over1000_brand_power)]
middle_brands3 = brand_power3[brand_power3['brand'].isin(over1000_brand_power)]
middle_brands4 = brand_power4[brand_power4['brand'].isin(over1000_brand_power)]
middle_brands5 = brand_power5[brand_power5['brand'].isin(over1000_brand_power)]
middle_brands6 = brand_power6[brand_power6['brand'].isin(over1000_brand_power)]
lower_brands = brand_power1[brand_power1['brand'].isin(under1000_brand_power)]
lower_brands2 = brand_power2[brand_power2['brand'].isin(under1000_brand_power)]
lower_brands3 = brand_power3[brand_power3['brand'].isin(under1000_brand_power)]
lower_brands4 = brand_power4[brand_power4['brand'].isin(under1000_brand_power)]
lower_brands5 = brand_power5[brand_power5['brand'].isin(under1000_brand_power)]
lower_brands6 = brand_power6[brand_power6['brand'].isin(under1000_brand_power)]
# Use known model and power to find Nan
top5 = top5_brands[['brand','model','power']].value_counts().sort_index()
top52 = top5_brands2[['brand','model','power']].value_counts().sort_index()
top53 = top5_brands3[['brand','model','power']].value_counts().sort_index()
top54 = top5_brands4[['brand','model','power']].value_counts().sort_index()
top55 = top5_brands5[['brand','model','power']].value_counts().sort_index()
top56 = top5_brands6[['brand','model','power']].value_counts().sort_index()
middle = middle_brands[['brand','model','power']].value_counts().sort_index()
middle2 = middle_brands2[['brand','model','power']].value_counts().sort_index()
middle3 = middle_brands3[['brand','model','power']].value_counts().sort_index()
middle4 = middle_brands4[['brand','model','power']].value_counts().sort_index()
middle5 = middle_brands5[['brand','model','power']].value_counts().sort_index()
middle6 = middle_brands6[['brand','model','power']].value_counts().sort_index()
lower = lower_brands[['brand','model','power']].value_counts().sort_index()
lower2 = lower_brands2[['brand','model','power']].value_counts().sort_index()
lower3 = lower_brands3[['brand','model','power']].value_counts().sort_index()
lower4 = lower_brands4[['brand','model','power']].value_counts().sort_index()
lower5 = lower_brands5[['brand','model','power']].value_counts().sort_index()
lower6 = lower_brands6[['brand','model','power']].value_counts().sort_index()
print("Batch 1: HP [60 & 70]")
# Top 5 Prevalent Brands w/ specified HP [60 & 70]
top5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
# Middle Prevalent Brands w/ specified HP [60 & 70]
middle.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
# Lower Prevalent Brands w/ specified HP [60 & 70]
lower.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [60 & 70] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 2: HP [150 & 101]")
# Batch 2: HP [150 & 101]
top52.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle2.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower2.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [150 & 101] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 3: HP [140 & 90]")
# Batch 3: HP [140 & 90]
top53.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle3.plot(kind = 'bar', x = ('brand','model','power'), figsize = (20,8))
plt.title('Middle: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower3.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [140 & 90] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 4: HP [116 & 105]")
# Batch 4: HP [116 & 105]
top54.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle4.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower4.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [116 & 105] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 5: HP [170 & 125]")
# Batch 5: HP [170 & 125]
top55.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower5.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [170 & 125] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
print("Batch 6: HP [136 & 102]")
# Batch 6: HP [136 & 102]
top56.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Top 5: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
middle6.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Middle: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
lower6.plot(kind = 'bar', x = ('brand','model','power'), figsize = (18,8))
plt.title('Lower: Count of [136 & 102] HP for Specified Brand/Model of Vehicle')
plt.xlabel("Brand, Model, HP")
plt.ylabel("HP Count")
plt.xticks(rotation = 90)
plt.grid()
plt.show()
Batch 1: HP [60 & 70]
Batch 2: HP [150 & 101]
Batch 3: HP [140 & 90]
Batch 4: HP [116 & 105]
Batch 5: HP [170 & 125]
Batch 6: HP [136 & 102]
audi75 = (df['brand'].isin(['audi'])) & (df['power'].isin([60,75])) & (df['model'].isna())
df.loc[audi75,['model']] = 'audi'
bmw75 = (df['brand'].isin(['bmw'])) & (df['power'].isin([60,75])) & (df['model'].isna())
df.loc[bmw75,['model']] = 'bmw'
opelsedan60 = (df['brand'].isin(['opel'])) & (df['power'].isin([60])) & (df['vehicletype'] == 'sedan') & (df['registrationyear'] < 1991) & (df['model'].isna())
df.loc[opelsedan60,['model']] = 'kadett'
opel9160 = (df['brand'].isin(['opel'])) & (df['power'].isin([60])) & ~(df['vehicletype'].isin(['wagon','small'])) & (df['registrationyear'] > 1990) & (df['registrationyear'] < 1992) & (df['model'].isna())
df.loc[opel9160,['model']] = 'kadett'
opelastra = (df['brand'].isin(['opel'])) & (df['vehicletype'] != 'small') & (df['power'].isin([60])) & (df['registrationyear'] > 1991) & (df['registrationyear'] < 1993)& (df['model'].isna())
df.loc[opelastra,['model']] = 'astra'
astraopel = (df['brand'].isin(['opel'])) & (df['vehicletype'] != 'small') & (df['power'].isin([60])) & (df['registrationyear'] > 1992) & (df['registrationyear'] < 2000) & (df['model'].isna())
df.loc[astraopel,['model']] = 'astra'
opelcorsa = (df['brand'].isin(['opel'])) & (df['vehicletype'] != 'bus') & (df['power'].isin([60])) & (df['model'].isna())
df.loc[opelcorsa,['model']] = 'corsa'
opelcombo = (df['brand'].isin(['opel'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[opelcombo,['model']] = 'combo'
civic75 = (df['brand'].isin(['honda'])) & (df['power'].isin([60, 75])) & (df['model'].isna())
df.loc[civic75,['model']] = 'civic'
mini75 = (df['brand'].isin(['mini'])) & (df['power'].isin([60, 75])) & (df['model'].isna())
df.loc[mini75,['model']] = 'one'
nissan60 = (df['brand'].isin(['nissan'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[nissan60,['model']] = 'micra'
seat60 = (df['brand'].isin(['seat'])) & (df['vehicletype'] != 'sedan') & (df['power'].isin([60])) & (df['model'].isna())
df.loc[seat60,['model']] = 'ibiza'
seatcordoba = (df['brand'].isin(['seat'])) & (df['power'].isin([60])) & (df['registrationyear'] == 1994) & (df['model'].isna())
df.loc[seatcordoba,['model']] = 'cordoba'
ibiza60 = (df['brand'].isin(['seat'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[ibiza60,['model']] = 'ibiza'
cordoba93 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1993])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[cordoba93,['model']] = 'cordoba'
ibiza94 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1994])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibiza94,['model']] = 'ibiza'
cordoba97 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1997])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[cordoba97,['model']] = 'cordoba'
ibizasmall = (df['brand'].isin(['seat'])) & (df['vehicletype'] == 'small') & (df['registrationyear'].isin([1999])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibizasmall,['model']] = 'ibiza'
cordoba99 = (df['brand'].isin(['seat'])) & (df['registrationyear'].isin([1999])) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[cordoba99,['model']] = 'cordoba'
ibiza03 = (df['brand'].isin(['seat'])) & (df['registrationyear'] > 2002) & (df['registrationyear'] < 2012) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[ibiza03,['model']] = 'ibiza'
skoda60 = (df['brand'].isin(['skoda'])) & (df['registrationyear'] > 2000) & (df['registrationyear'] != 2013) & (df['power'].isin([60,75])) & (df['model'].isna())
df.loc[skoda60,['model']] = 'fabia'
lancia60 = (df['brand'].isin(['lancia'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[lancia60,['model']] = 'ypsilon'
smart60 = (df['brand'].isin(['smart'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[smart60,['model']] = 'fortwo'
smart75 = (df['brand'].isin(['smart'])) & (df['registrationyear'] > 2003) & (df['power'].isin([75])) & (df['model'].isna())
df.loc[smart75,['model']] = 'forfour'
bmw101 = (df['brand'].isin(['bmw'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[bmw101,['model']] = '3er'
ford101 = (df['brand'].isin(['ford'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[ford101,['model']] = 'focus'
chevy150 = (df['brand'].isin(['chevrolet'])) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[chevy150,['model']] = 'other'
mit150 = (df['brand'].isin(['mitsubishi'])) & (df['registrationyear'] < 1994) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mit150,['model']] = 'other'
mitgalant = (df['brand'].isin(['mitsubishi'])) & (df['registrationyear'] == 1996) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitgalant,['model']] = 'galant'
mit99 = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 1999) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mit99,['model']] = 'galant'
mitbus = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'bus') & (df['registrationyear'] == 1999) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitbus,['model']] = 'other'
mitbus00 = (df['brand'].isin(['mitsubishi'])) & (df['vehicletype'] == 'bus') & (df['power'].isin([150])) & (df['model'].isna())
df.loc[mitbus00,['model']] = 'other'
honda101 = (df['brand'].isin(['honda'])) & (df['power'].isin([101])) & (df['model'].isna())
df.loc[honda101,['model']] = 'civic'
honda150 = (df['brand'].isin(['honda'])) & (df['power'].isin([150])) & (df['model'].isna())
df.loc[honda150,['model']] = 'cr_reihe'
hondasuv = (df['brand'] == 'honda') & (df['model'].isna()) & (df['vehicletype'] == 'suv')
df.loc[hondasuv,['model']] = 'cr_reihe'
topbrand_vt = ['volkswagen']
vt_power = df[(df['brand'].notna()) & (df['brand'].isin(topbrand_vt)) & (df['model'].notna()) & (df['vehicletype'].notna())]
vwvt = vt_power[['vehicletype','model']].value_counts().sort_index()
vwvt.plot(kind = 'bar', figsize = (16,8))
plt.title("Volkswagen: Model & Vehicle Type Abundance")
plt.grid()
plt.show()
# VW GOLF
golf = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([211,230,174,102, 122, 350, 250, 170, 86, 200,100,109,190,68,80,72,131,144,129,77,160,76,204])) & (df['model'].isna())
df.loc[golf,['model']] = 'golf'
golf02 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([90])) & (df['registrationyear'] > 2002) & (df['model'].isna())
df.loc[golf02,['model']] = 'golf'
golf98 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([60])) & (df['registrationyear'] < 1998) & (df['model'].isna())
df.loc[golf98,['model']] = 'golf'
golf09 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([125,110])) & (df['registrationyear'] == 2009) & (df['model'].isna())
df.loc[golf09,['model']] = 'golf'
golf99 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([150])) & (df['registrationyear'] > 1999) & (df['model'].isna())
df.loc[golf99,['model']] = 'golf'
golf04 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([140])) & (df['registrationyear'] == 2004) & (df['model'].isna())
df.loc[golf04,['model']] = 'golf'
golf91 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([55])) & (df['registrationyear'] != 1991) & (df['model'].isna())
df.loc[golf91,['model']] = 'golf'
# VW POLO
polo = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([64,54])) & (df['model'].isna())
df.loc[polo,['model']] = 'polo'
polo98 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([60])) & (df['registrationyear'] > 1998) & (df['model'].isna())
df.loc[polo98,['model']] = 'polo'
# VW PASSAT
passat = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([148,136])) & (df['model'].isna())
df.loc[passat,['model']] = 'passat'
passat97 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([125])) & (df['registrationyear'] == 1997) & (df['model'].isna())
df.loc[passat97,['model']] = 'passat'
### VW BEETLE
beetle = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([30])) & (df['model'].isna())
df.loc[beetle,['model']] = 'beetle'
#### VW JETTA
jetta = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([70])) & (df['registrationyear'] == 1981) & (df['model'].isna())
df.loc[jetta,['model']] = 'jetta'
### VW PHAETON
phaeton = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['power'].isin([313,420,240])) & (df['model'].isna())
df.loc[phaeton,['model']] = 'phaeton'
phaeton05 = (df['brand'] == 'volkswagen') & (df['vehicletype'] == 'sedan') & (df['registrationyear'] == 2005) & (df['power'].isin([224])) & (df['model'].isna())
df.loc[phaeton05,['model']] = 'phaeton'
trabant = (df['vehicletype'] == 'wagon') & (df['brand'] == 'trabant') & (df['model'].isna())
df.loc[trabant,['model']] = '601'
bmw = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1990) & (df['brand'] == 'bmw') & (df['model'].isna())
df.loc[bmw,['model']] = '3er'
vw80 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1990) & (df['brand'] == 'volkswagen') & (df['model'].isna())
df.loc[vw80,['model']] = 'passat'
opel82 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1982) & (df['brand'] == 'opel') & (df['model'].isna())
df.loc[opel82,['model']] = 'kadett'
other82 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] < 1988) & (df['brand'] != 'sonstige_autos') & (df['model'].isna())
df.loc[other82,['model']] = 'other'
volvo89 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1989) & (df['brand'] == 'volvo') & (df['model'].isna())
df.loc[volvo89,['model']] = 'other'
audi100 = (df['vehicletype'] == 'wagon') & (df['registrationyear'] == 1990) & (df['brand'] == 'audi') & (df['model'].isna())
df.loc[audi100,['model']] = '100'
freelander = (df['brand'] == 'land_rover') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([111,115,60,129,140,109,])) & (df['registrationyear'] > 1992) & (df['registrationyear'] < 2006) & (df['model'].isna())
df.loc[freelander,['model']] = 'freelander'
ypsilon = (df['brand'] == 'lancia') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([44,70,74,75,602,1200])) & (df['model'].isna())
df.loc[ypsilon,['model']] = 'ypsilon'
logan = (df['brand'] == 'dacia') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([75,84,85,105])) & (df['registrationyear'].isin([2009,2012,2013,2015])) & (df['model'].isna())
df.loc[logan,['model']] = 'logan'
porscheother = (df['brand'] == 'porsche') & (df['vehicletype'] == 'coupe') & (df['power'].isin([125,160])) & (df['registrationyear'].isin([1981,1989])) & (df['model'].isna())
df.loc[porscheother,['model']] = 'other'
justy = (df['brand'] == 'subaru') & (df['vehicletype'] == 'small') & (df['power'].isin([25,34,50,60,68])) & (df['registrationyear'].isin([1996,1997,2000])) & (df['model'].isna())
df.loc[justy,['model']] = 'justy'
otherrover = (df['brand'] == 'rover') & (df['vehicletype'] == 'sedan') & (df['power'].isin([75,100,111,120,150,16,77,85,105,16,77,85,105,108,116,130,174])) & (df['registrationyear'].isin([1996,1997,1998,1999,2000,2001,2002,2003])) & (df['model'].notna())
df.loc[otherrover,['model']] = 'other'
chryslerother = (df['brand'] == 'chrysler') & (df['vehicletype'] == 'sedan') & (df['power'].isin([133,254,250,85,100,109,122,137,186])) & (df['registrationyear'].isin([1952,1977,1996,1998,1999,2000,2002,2008,2010])) & (df['model'].isna())
df.loc[chryslerother,['model']] = 'other'
voyager = (df['brand'] == 'chrysler') & (df['vehicletype'] == 'bus') & (df['power'].isin([151])) & (df['registrationyear'].isin([1996,1997,1999])) & (df['model'].isna())
df.loc[voyager,['model']] = 'voyager'
t601 = (df['brand'] == 'trabant') & (df['vehicletype'] == 'sedan') & (df['power'].isin([26,45])) & (df['registrationyear'].isin([1982,1988,1989,1977,1986,1984,1998])) & (df['model'].isna())
df.loc[t601,['model']] = '601'
six = (df['brand'] == 'trabant') & (df['vehicletype'].isin(['small','coupe'])) & (df['power'].isin([60,26,75])) & (df['registrationyear'].isin([1988,1998,2004,2008])) & (df['model'].isna())
df.loc[six,['model']] = '601'
otherchevy = (df['brand'] == 'chevrolet') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([64,141,75,94,95,54,109,125,195,163,130,105,124,72,69,60,360])) & (df['registrationyear'].isin([2011,2005,1968,1978,2000,2006,2010,2012])) & (df['model'].isna())
df.loc[otherchevy,['model']] = 'other'
volvoother = (df['brand'] == 'volvo') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([115,131,52,105,113,116])) & (df['registrationyear'].isin([1996,1991,1993,2007,1985,1988,1998,1999,2004,2012])) & (df['model'].isna())
df.loc[volvoother,['model']] = 'other'
kother = (df['brand'] == 'kia') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105,138,140,48,101,113,133,143,203])) & (df['registrationyear'].isin([2005,2007,2001,2002,2003,2004])) & (df['model'].isna())
df.loc[kother,['model']] = 'other'
rio = (df['brand'] == 'kia') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([97,109,83,98,105,125,138,139,150])) & (df['registrationyear'].isin([2003,2000,2007,1999,2001,2002])) & (df['model'].isna())
df.loc[rio,['model']] = 'rio'
sorento = (df['brand'] == 'kia') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([140,78,110,133,194])) & (df['registrationyear'].isin([2006,2001,2004,1995,1999,2012])) & (df['model'].isna())
df.loc[sorento,['model']] = 'sorento'
civic = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([90,124,125])) & (df['registrationyear'].isin([1992,1991,1993])) & (df['model'].isna())
df.loc[civic,['model']] = 'civic'
jazz = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([2010])) & (df['model'].isna())
df.loc[jazz,['model']] = 'jazz'
hother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[hother,['model']] = 'other'
civcou = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([90,114,100,105,107,109])) & (df['registrationyear'].isin([2000,1995,1996,1998,1989,1999,2006])) & (df['model'].isna())
df.loc[civcou,['model']] = 'civic'
honother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([133,185])) & (df['registrationyear'].isin([2000,1992,1998])) & (df['model'].isna())
cother = (df['brand'] == 'honda') & (df['vehicletype'].isin(['coupe'])) & (df['power'].isin([133])) & (df['registrationyear'].isin([2000,1992,1998,])) & (df['model'].isna())
df.loc[cother,['model']] = 'other'
jbus = (df['brand'] == 'honda') & (df['vehicletype'].isin(['bus'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([2010,2012,2013])) & (df['model'].isna())
df.loc[jbus,['model']] = 'jazz'
octavia = (df['brand'] == 'skoda') & (df['price'] > 2099) & (df['price'] < 5701) & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([102,105,150])) & (df['registrationyear'].isin([2001,2005,2007,2008])) & (df['model'].isna())
df.loc[octavia,['model']] = octavia
swift = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([53,50,55,58,92])) & (df['registrationyear'].isin([1997,2000,1998,2003,2008])) & (df['model'].isna())
df.loc[swift,['model']] = 'swift'
suzother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([63,52,65,76,83,84,57,96])) & (df['registrationyear'].isin([1990, 1995,1996,1999,2002,1997,2001,2004,2007])) & (df['model'].isna())
df.loc[suzother,['model']] = 'other'
ukiother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2009,2011,2012])) & (df['model'].isna())
df.loc[ukiother,['model']] = 'other'
jimny = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([86,82,88])) & (df['registrationyear'].isin([2001,2005,2003])) & (df['model'].isna())
df.loc[jimny,['model']] = 'jimny'
zother = (df['brand'] == 'suzuki') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([97,45,68,75,85,98,136,170])) & (df['registrationyear'].isin([1995,1996,1988,1992,1998,2006,2007])) & (df['model'].isna())
df.loc[zother,['model']] = 'other'
carisma = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([125,115])) & (df['registrationyear'].isin([2002,1995,1998,1997,2000,2003])) & (df['model'].isna())
df.loc[carisma,['model']] = 'carisma'
colt = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,90,95])) & (df['registrationyear'].isin([2002,2009,2000,2006])) & (df['model'].isna())
df.loc[colt,['model']] = 'colt'
coltt = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,70,95,82,150])) & (df['registrationyear'].isin([1999,1996,1998,2006,2009,1997,2000,2002,2010,2012,2001])) & (df['model'].isna())
df.loc[coltt,['model']] = 'colt'
lancer = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004,1997,2007])) & (df['model'].isna())
df.loc[lancer,['model']] = 'lancer'
galant = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([160,165])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004,1997,2007])) & (df['model'].isna())
df.loc[galant,['model']] = 'galant'
wother = (df['brand'] == 'mitsubishi') & (df['vehicletype'].isin(['wagon'])) & (df['power'].isin([82,86,83,101,132,125])) & (df['registrationyear'].isin([1999,2000,2001,2003,2004])) & (df['model'].isna())
df.loc[wother,['model']] = 'other'
yaris = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,86,90,87])) & (df['registrationyear'].isin([2008,2000,2001,2002])) & (df['model'].isna())
df.loc[yaris,['model']] = 'yaris'
aygo = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2008,2006,2009])) & (df['model'].isna())
df.loc[aygo,['model']] = 'aygo'
yar = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([1999,2001])) & (df['model'].isna())
df.loc[yar,['model']] = 'yaris'
cor = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([97])) & (df['registrationyear'].isin([2003,2000,2001])) & (df['model'].isna())
df.loc[cor,['model']] = 'corolla'
corolla = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([86])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[corolla,['model']] = 'corolla'
sixty = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([1997])) & (df['model'].isna())
df.loc[sixty,['model']] = 'other'
tother = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1993,1995,1997])) & (df['model'].isna())
df.loc[tother,['model']] = 'other'
coro = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,88,90,110])) & (df['registrationyear'].isin([1993,2006,1995,2008])) & (df['model'].isna())
df.loc[coro,['model']] = 'corolla'
auris = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([177,124,126])) & (df['registrationyear'].isin([2007,2010])) & (df['model'].isna())
df.loc[auris,['model']] = 'auris'
llo = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([72,97,105])) & (df['registrationyear'].isin([1992,2003])) & (df['model'].isna())
df.loc[llo,['model']] = 'corolla'
avensis = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([177])) & (df['model'].isna())
df.loc[avensis,['model']] = 'avensis'
sedoy = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([63,91,180])) & (df['registrationyear'].isin([1993,1998,2009])) & (df['model'].isna())
df.loc[sedoy,['model']] = 'other'
yar = (df['brand'] == 'toyota') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([1999])) & (df['model'].isna())
df.loc[yar,['model']] = 'yaris'
micra = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([54,65,50,55,40])) & (df['registrationyear'].isin([1994,2009,1998,1995,1999,2000,2004,1991,1996,1997,2008,2013])) & (df['model'].isna())
df.loc[micra,['model']] = 'micra'
micraa = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65,80])) & (df['registrationyear'].isin([2003,2014])) & (df['model'].isna())
df.loc[micraa,['model']] = 'micra'
micraaa = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([2013])) & (df['model'].isna())
df.loc[micraaa,['model']] = 'micra'
qashqai = (df['brand'] == 'nissan') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2011])) & (df['model'].isna())
df.loc[qashqai,['model']] = 'qashqai'
ibiza = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75,64,86,69,70,85])) & (df['registrationyear'].isin([2002,2001,2003,2011,2007])) & (df['model'].isna())
df.loc[ibiza,['model']] = 'ibiza'
arosa = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([50])) & (df['registrationyear'].isin([1999,2002,1998,2000,2001,1997])) & (df['model'].isna())
df.loc[arosa,['model']] = 'arosa'
ibizaa = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([86,101,69])) & (df['registrationyear'].isin([2006,2012,2013])) & (df['model'].isna())
df.loc[ibizaa,['model']] = 'ibiza'
ibiza1 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([200,51])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ibiza1,['model']] = 'ibiza'
other1 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([25])) & (df['model'].isna())
df.loc[other1,['model']] = 'other'
cordoba75 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1996,1998])) & (df['model'].isna())
df.loc[cordoba75,['model']] = 'cordoba'
leon07 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105])) & (df['registrationyear'].isin([2007])) & (df['model'].isna())
df.loc[leon07,['model']] = 'leon'
leon160 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([102,160,265])) & (df['registrationyear'].isin([2007,2008,2009,2012])) & (df['model'].isna())
df.loc[leon160,['model']] = 'leon'
toledo = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([101,150])) & (df['registrationyear'].isin([1998,1999])) & (df['model'].isna())
df.loc[toledo,['model']] = 'toledo'
leon140 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([140])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[leon140,['model']] = 'leon'
toledo150 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[toledo150,['model']] = 'toledo'
ibiza09 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([86])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ibiza09,['model']] = 'ibiza'
ibiza07 = (df['brand'] == 'seat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([64])) & (df['model'].isna())
df.loc[ibiza07,['model']] = 'ibiza'
getz = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,82,88,97])) & (df['registrationyear'].isin([2003,2007,2002])) & (df['model'].isna())
df.loc[getz,['model']] = 'getz'
i_reihe = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68,77,78])) & (df['registrationyear'].isin([2010,2009,2011,2007])) & (df['model'].isna())
df.loc[i_reihe,['model']] = 'i_reihe'
getz03 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,63,67,65,90])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[getz03,['model']] = 'getz'
yother = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([58,54,55,60,75,40])) & (df['registrationyear'].isin([1998,1999,1996,2000,2001,2002])) & (df['model'].isna())
df.loc[yother,['model']] = 'other'
yot = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['model'].isna())
df.loc[yot,['model']] = 'other'
ir = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([67])) & (df['registrationyear'].isin([2010])) & (df['model'].isna())
df.loc[ir,['model']] = 'i_reihe'
other58 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([58])) & (df['model'].isna())
df.loc[other58,['model']] = 'other'
i = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([63,65,79,90])) & (df['registrationyear'].isin([2011])) & (df['model'].isna())
df.loc[i,['model']] = 'i_reihe'
rei = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([109,90,78])) & (df['registrationyear'].isin([2009,2010,2011])) & (df['model'].isna())
df.loc[rei,['model']] = 'i_reihe'
other99 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([82,140,160,235])) & (df['registrationyear'].isin([1999,2003,2005,2006])) & (df['model'].isna())
df.loc[other99,['model']] = 'other'
other94 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([75,85,86,131,136,54])) & (df['registrationyear'].isin([1994,2000,2001,2002,2005])) & (df['model'].isna())
df.loc[other94,['model']] = 'other'
santa = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([145,155,170])) & (df['registrationyear'].isin([2003,2002,2004,2006,2008])) & (df['model'].isna())
df.loc[santa,['model']] = 'santa'
he = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([163,140,184])) & (df['registrationyear'].isin([2010,2013])) & (df['model'].isna())
df.loc[he,['model']] = 'i_reihe'
shother = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([163,99])) & (df['registrationyear'].isin([2006,1998,2000,2005])) & (df['model'].isna())
df.loc[shother,['model']] = 'other'
santa140 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([140])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[santa140,['model']] = 'santa'
other150 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2002,2003])) & (df['model'].isna())
df.loc[other150,['model']] = 'other'
santa06 = (df['brand'] == 'hyundai') & (df['vehicletype'].isin(['suv'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[santa06,['model']] = 'santa'
c1 = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([68])) & (df['registrationyear'].isin([2008,2011])) & (df['model'].isna())
df.loc[c1,['model']] = 'c1'
c3 = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([2003])) & (df['model'].isna())
df.loc[c3,['model']] = 'c3'
othercit = (df['brand'] == 'citroen') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60,75])) & (df['registrationyear'].isin([2001,1999,2000,1998])) & (df['model'].isna())
df.loc[othercit,['model']] = 'other'
fortwo = (df['brand'] == 'smart') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([61,45,54,41,55,71,40,50,72])) & (df['registrationyear'].isin([2005,2002,1999,2001,2000,2012,2004,2003,2008,1998,2007,2011,2009,2014])) & (df['model'].isna())
df.loc[fortwo,['model']] = 'fortwo'
forfour = (df['brand'] == 'smart') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([109])) & (df['registrationyear'].isin([2006])) & (df['model'].isna())
df.loc[forfour,['model']] = 'forfour'
ftvert = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([54,41,45])) & (df['registrationyear'].isin([2000,2001,2005,2006,2008])) & (df['model'].isna())
df.loc[ftvert,['model']] = 'fortwo'
vertft = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([55,61])) & (df['registrationyear'].isin([2000,2001,2002])) & (df['model'].isna())
df.loc[vertft,['model']] = 'fortwo'
ft = (df['brand'] == 'smart') & (df['vehicletype'].isin(['convertible'])) & (df['power'].isin([84])) & (df['registrationyear'].isin([2009])) & (df['model'].isna())
df.loc[ft,['model']] = 'fortwo'
sixre = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([115,116,166,141,120])) & (df['registrationyear'].isin([1999,2003])) & (df['model'].isna())
df.loc[sixre,['model']] = '6_reihe'
sre = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([90])) & (df['registrationyear'].isin([1997,1998,1996,2000,1990])) & (df['model'].isna())
df.loc[sre,['model']] = '6_reihe'
mazother = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([144,163,109])) & (df['registrationyear'].isin([1997,2001,1993])) & (df['model'].isna())
df.loc[mazother,['model']] = 'other'
three = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([98])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[three,['model']] = '3_reihe'
three88 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([88])) & (df['registrationyear'].isin([1997,1995,1998,1996])) & (df['model'].isna())
df.loc[three88,['model']] = '3_reihe'
rh6 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([166,141,163])) & (df['registrationyear'].isin([2002,2010])) & (df['model'].isna())
df.loc[rh6,['model']] = '6_reihe'
thei = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([105,73])) & (df['registrationyear'].isin([1996,2006,1997,2005,2008])) & (df['model'].isna())
df.loc[thei,['model']] = '3_reihe'
ihth = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([144,98,114,150,109,86,75])) & (df['registrationyear'].isin([1999,1995,2003,2006,2000,2010])) & (df['model'].isna())
df.loc[ihth,['model']] = '3_reihe'
eeh = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([1997,1996,2000])) & (df['model'].isna())
df.loc[eeh,['model']] = '3_reihe'
hee = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1997,1999])) & (df['model'].isna())
df.loc[hee,['model']] = '3_reihe'
ri3 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([88,98,65,109])) & (df['registrationyear'].isin([1998,1999,1996,2002,2003,2006,2008])) & (df['model'].isna())
df.loc[ri3,['model']] = '3_reihe'
reihe373 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([73])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[reihe373,['model']] = '3_reihe'
other7509 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([2009,2013])) & (df['model'].isna())
df.loc[other7509,['model']] = 'other'
reihe1 = (df['brand'] == 'mazda') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([75])) & (df['registrationyear'].isin([1995])) & (df['model'].isna())
df.loc[reihe1,['model']] = '1_reihe'
punto60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2000.0, 2001.0, 2002.0, 2003.0, 1999.0, 1998.0,
1997.0, 1996.0, 1993.0, 1994.0])) & (df['model'].isna())
df.loc[punto60,['model']] = 'punto'
panda60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2010.0, 2008.0, 2011.0, 1991.0])) & (df['model'].isna())
df.loc[panda60,['model']] = 'panda'
seicento60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55])) & (df['registrationyear'].isin([2000,2001])) & (df['model'].isna())
df.loc[seicento60,['model']] = 'seicento'
punto65 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([65])) & (df['registrationyear'].isin([2010.0, 2000.0, 1999.0,
1996.0, 1998.0, 2001.0, 2003.0, 2004.0])) & (df['model'].isna())
df.loc[punto65,['model']] = 'punto'
punto01 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60, 80, 44, 75, 90, 65, 85, 64, 68, 86])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[punto01,['model']] = 'punto'
seicento01 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([55,50])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[seicento01,['model']] = 'seicento'
stilo170 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([170])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[stilo170,['model']] = 'stilo'
other101 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([101])) & (df['registrationyear'].isin([2001])) & (df['model'].isna())
df.loc[other101,['model']] = 'other'
punto98 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([60, 86, 65, 75, 44])) & (df['registrationyear'].isin([1998])) & (df['model'].isna())
df.loc[punto98,['model']] = 'punto'
five69 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([69])) & (df['registrationyear'].isin([2008.0, 2009.0, 2010.0, 2013.0])) & (df['model'].isna())
df.loc[five69,['model']] = '500'
puntorand = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['small'])) & (df['power'].isin([80, 86, 85, 69, 64])) & (df['registrationyear'].isin([1999,2003,2000])) & (df['model'].isna())
df.loc[puntorand,['model']] = 'punto'
stilo103 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([103, 80, 170, 115, 102])) & (df['registrationyear'].isin([2002.0, 2003.0, 2004.0, 2005.0])) & (df['model'].isna())
df.loc[stilo103,['model']] = 'stilo'
bravo150 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([150])) & (df['registrationyear'].isin([2007.0, 2008.0])) & (df['model'].isna())
df.loc[bravo150,['model']] = 'bravo'
bravo08 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2008.0])) & (df['model'].isna())
df.loc[bravo08,['model']] = 'bravo'
punto60 = (df['brand'] == 'fiat') & (df['vehicletype'].isin(['sedan'])) & (df['power'].isin([60])) & (df['registrationyear'].isin([2003,2000])) & (df['model'].isna())
df.loc[punto60,['model']] = 'punto'
re2 = (df['brand'] == 'peugeot') & (df['vehicletype'].isin(['small']))& (df['power'].isin([60])) & (df['registrationyear'].isin([2004.0, 2005.0,
2011.0, 2010.0, 1990.0])) & (df['model'].isna())
df.loc[re2,['model']] = '2_reihe'
twore = (df['brand'] == 'peugeot') & (df['vehicletype'].isin(['convertible']))& (df['power'].isin([120,109])) & (df['registrationyear'].isin([2003.0, 2002.0, 2004.0, 2005.0, 2011.0, 2012.0])) & (df['model'].isna())
df.loc[twore,['model']] = '2_reihe'
fiestarand = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([82, 150,182, 61,81])) & (df['registrationyear'].isin([2006.0, 2009.0, 2014.0, 2000.0, 2005.0])) & (df['model'].isna())
df.loc[fiestarand,['model']] = 'fiesta'
fiestaa = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([60])) & (df['registrationyear'].isin([1992.0, 2010.0])) & (df['model'].isna())
df.loc[fiestaa,['model']] = 'fiesta'
fiestaaa = (df['brand'] == 'ford') & (df['vehicletype'].isin(['small']))& (df['power'].isin([50, 75, 54, 66, 103])) & (df['registrationyear'].isin([2000])) & (df['model'].isna())
df.loc[fiestaaa,['model']] = 'fiesta'
del brand_power1
del brand_power2
del brand_power3
del brand_power4
del brand_power5
del brand_power6
del top5_brand_power
del over1000_brand_power
del under1000_brand_power
del top5_brands
del top5_brands2
del top5_brands3
del top5_brands4
del top5_brands5
del top5_brands6
del middle_brands
del middle_brands2
del middle_brands3
del middle_brands4
del middle_brands5
del middle_brands6
del lower_brands
del lower_brands2
del lower_brands3
del lower_brands4
del lower_brands5
del lower_brands6
del top5
del top52
del top53
del top54
del top55
del top56
del middle
del middle2
del middle3
del middle4
del middle5
del middle6
del lower
del lower2
del lower3
del lower4
del lower5
del lower6
def fill_missing_models(df):
df = df.copy()
# Split data into known and missing model subsets
known = df[df['model'].notna()]
missing = df[df['model'].isna()]
# --- Step 1: Keep only combinations that map to exactly one model ---
unique_models = (
known.groupby(['brand', 'vehicletype', 'power', 'registrationyear'])['model']
.nunique()
.reset_index(name='model_count')
)
# Only combos with one unique model (avoid ambiguous mappings)
unique_keys = unique_models[unique_models['model_count'] == 1].drop(columns='model_count')
# Merge these unique combos with their actual model name
unique_known = known.merge(unique_keys, on=['brand', 'vehicletype', 'power', 'registrationyear'])
unique_known = unique_known[['brand', 'vehicletype', 'power', 'registrationyear', 'model']].drop_duplicates()
# --- Step 2: Merge to fill missing models safely ---
filled = missing.merge(
unique_known,
on=['brand', 'vehicletype', 'power', 'registrationyear'],
how='left',
suffixes=('', '_known')
)
# Fill in model from the unique match
filled['model'] = filled['model_known'].combine_first(filled['model'])
filled = filled.drop(columns=['model_known'])
# --- Step 3: Combine back with known data ---
result = pd.concat([known, filled], ignore_index=True)
return result
df_new = df.copy()
df_new = fill_missing_models(df_new)
df_new.isna().sum()
datecrawled 0 price 0 vehicletype 37471 registrationyear 0 gearbox 19830 power 0 model 15662 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 dtype: int64
df_new.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 354107 entries, 0 to 354106 Data columns (total 17 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 354107 non-null object 1 price 354107 non-null int64 2 vehicletype 316636 non-null object 3 registrationyear 354107 non-null float64 4 gearbox 334277 non-null object 5 power 354107 non-null int64 6 model 338445 non-null object 7 mileage 354107 non-null int64 8 registrationmonth 354107 non-null int64 9 fueltype 321218 non-null object 10 brand 354107 non-null object 11 notrepaired 282962 non-null object 12 datecreated 354107 non-null datetime64[ns] 13 numberofpictures 354107 non-null int64 14 postalcode 354107 non-null int64 15 lastseen 354107 non-null object 16 registration_correction 354107 non-null object dtypes: datetime64[ns](1), float64(1), int64(6), object(9) memory usage: 45.9+ MB
def analyze_missing_models(df, brand):
# Focus on the brand
brand_df = df[df['brand'] == brand]
# Step 1: Check which vehicle types are most common for missing models
vt_counts = brand_df[brand_df['model'].isna()]['vehicletype'].value_counts()
print(f"\n--- {brand.upper()} ---")
print("Vehicle types with missing models:")
print(vt_counts)
# Step 2: For each vehicle type, show power distribution
for vt in vt_counts.index:
subset = brand_df[(brand_df['model'].isna()) & (brand_df['vehicletype'] == vt)]
pw_counts = subset['power'].value_counts()
print(f"\n{vt}: Power distribution for missing models")
print(pw_counts)
print(pw_counts.index)
# Step 3: Show registration year distribution
reg_counts = subset['registrationyear'].value_counts()
print(f"\n{vt}: Registration year distribution for missing models")
print(reg_counts)
print(reg_counts.index)
analyze_missing_models(df_new, 'ford')
--- FORD ---
Vehicle types with missing models:
small 198
wagon 101
sedan 69
bus 43
coupe 30
suv 13
other 13
convertible 9
Name: vehicletype, dtype: int64
small: Power distribution for missing models
0 67
60 60
50 16
75 16
90 5
80 5
55 4
44 3
70 3
45 3
68 2
65 2
100 2
69 1
67 1
71 1
74 1
59 1
95 1
96 1
110 1
116 1
118 1
Name: power, dtype: int64
Int64Index([ 0, 60, 50, 75, 90, 80, 55, 44, 70, 45, 68, 65, 100,
69, 67, 71, 74, 59, 95, 96, 110, 116, 118],
dtype='int64')
small: Registration year distribution for missing models
1999.0 30
1998.0 25
1997.0 23
2000.0 21
2002.0 21
2001.0 17
2004.0 12
1996.0 11
2003.0 11
2005.0 8
2006.0 5
1990.0 5
2007.0 3
1978.0 1
2014.0 1
2009.0 1
2008.0 1
1994.0 1
1992.0 1
Name: registrationyear, dtype: int64
Float64Index([1999.0, 1998.0, 1997.0, 2000.0, 2002.0, 2001.0, 2004.0, 1996.0,
2003.0, 2005.0, 2006.0, 1990.0, 2007.0, 1978.0, 2014.0, 2009.0,
2008.0, 1994.0, 1992.0],
dtype='float64')
wagon: Power distribution for missing models
0 36
115 14
116 11
90 9
131 7
109 4
100 4
120 3
75 3
117 2
105 1
128 1
89 1
60 1
170 1
150 1
140 1
125 1
Name: power, dtype: int64
Int64Index([ 0, 115, 116, 90, 131, 109, 100, 120, 75, 117, 105, 128, 89,
60, 170, 150, 140, 125],
dtype='int64')
wagon: Registration year distribution for missing models
2001.0 13
1998.0 12
2000.0 12
1999.0 12
2005.0 11
2002.0 9
2003.0 8
2004.0 7
1997.0 4
1996.0 4
2006.0 3
2008.0 2
2007.0 2
1995.0 1
1990.0 1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 1998.0, 2000.0, 1999.0, 2005.0, 2002.0, 2003.0, 2004.0,
1997.0, 1996.0, 2006.0, 2008.0, 2007.0, 1995.0, 1990.0],
dtype='float64')
sedan: Power distribution for missing models
0 19
75 7
90 7
116 5
115 4
110 3
95 2
50 2
77 2
226 2
148 1
109 1
1002 1
105 1
230 1
1120 1
94 1
29 1
136 1
131 1
85 1
145 1
66 1
170 1
38 1
89 1
Name: power, dtype: int64
Int64Index([ 0, 75, 90, 116, 115, 110, 95, 50, 77, 226, 148,
109, 1002, 105, 230, 1120, 94, 29, 136, 131, 85, 145,
66, 170, 38, 89],
dtype='int64')
sedan: Registration year distribution for missing models
1998.0 10
1999.0 8
1997.0 6
2000.0 5
1996.0 5
2001.0 4
2002.0 4
2006.0 3
2005.0 3
1995.0 2
1993.0 2
2009.0 2
1989.0 2
2013.0 1
1976.0 1
1960.0 1
1940.0 1
1970.0 1
2007.0 1
1994.0 1
1977.0 1
2004.0 1
1988.0 1
1978.0 1
1979.0 1
1967.0 1
Name: registrationyear, dtype: int64
Float64Index([1998.0, 1999.0, 1997.0, 2000.0, 1996.0, 2001.0, 2002.0, 2006.0,
2005.0, 1995.0, 1993.0, 2009.0, 1989.0, 2013.0, 1976.0, 1960.0,
1940.0, 1970.0, 2007.0, 1994.0, 1977.0, 2004.0, 1988.0, 1978.0,
1979.0, 1967.0],
dtype='float64')
bus: Power distribution for missing models
0 15
116 6
125 4
75 2
140 2
90 2
131 2
135 1
147 1
80 1
130 1
146 1
211 1
128 1
98 1
175 1
145 1
Name: power, dtype: int64
Int64Index([0, 116, 125, 75, 140, 90, 131, 135, 147, 80, 130, 146, 211, 128,
98, 175, 145],
dtype='int64')
bus: Registration year distribution for missing models
2005.0 8
2001.0 5
2009.0 4
1998.0 4
2006.0 3
2003.0 3
1999.0 3
2008.0 2
2007.0 2
1997.0 2
1996.0 2
2000.0 2
1993.0 1
1992.0 1
2004.0 1
Name: registrationyear, dtype: int64
Float64Index([2005.0, 2001.0, 2009.0, 1998.0, 2006.0, 2003.0, 1999.0, 2008.0,
2007.0, 1997.0, 1996.0, 2000.0, 1993.0, 1992.0, 2004.0],
dtype='float64')
coupe: Power distribution for missing models
0 10
130 5
131 2
90 2
100 1
69 1
136 1
138 1
140 1
132 1
179 1
145 1
120 1
122 1
125 1
Name: power, dtype: int64
Int64Index([0, 130, 131, 90, 100, 69, 136, 138, 140, 132, 179, 145, 120, 122,
125],
dtype='int64')
coupe: Registration year distribution for missing models
2002.0 7
2000.0 7
2001.0 3
1999.0 3
1995.0 2
2006.0 2
1998.0 2
1980.0 1
1997.0 1
1978.0 1
2009.0 1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 2000.0, 2001.0, 1999.0, 1995.0, 2006.0, 1998.0, 1980.0,
1997.0, 1978.0, 2009.0],
dtype='float64')
suv: Power distribution for missing models
124 4
0 3
150 1
340 1
165 1
196 1
203 1
125 1
Name: power, dtype: int64
Int64Index([124, 0, 150, 340, 165, 196, 203, 125], dtype='int64')
suv: Registration year distribution for missing models
1994.0 5
2009.0 1
1987.0 1
2001.0 1
2003.0 1
1977.0 1
2004.0 1
1989.0 1
2006.0 1
Name: registrationyear, dtype: int64
Float64Index([1994.0, 2009.0, 1987.0, 2001.0, 2003.0, 1977.0, 2004.0, 1989.0,
2006.0],
dtype='float64')
other: Power distribution for missing models
0 3
157 2
226 1
70 1
80 1
205 1
240 1
115 1
109 1
175 1
Name: power, dtype: int64
Int64Index([0, 157, 226, 70, 80, 205, 240, 115, 109, 175], dtype='int64')
other: Registration year distribution for missing models
1993.0 2
1984.0 2
1964.0 1
2008.0 1
1959.0 1
2001.0 1
1953.0 1
1996.0 1
2000.0 1
2005.0 1
2006.0 1
Name: registrationyear, dtype: int64
Float64Index([1993.0, 1984.0, 1964.0, 2008.0, 1959.0, 2001.0, 1953.0, 1996.0,
2000.0, 2005.0, 2006.0],
dtype='float64')
convertible: Power distribution for missing models
95 3
90 2
0 1
116 1
70 1
190 1
Name: power, dtype: int64
Int64Index([95, 90, 0, 116, 70, 190], dtype='int64')
convertible: Registration year distribution for missing models
2004.0 4
1997.0 1
2003.0 1
1999.0 1
1996.0 1
1992.0 1
Name: registrationyear, dtype: int64
Float64Index([2004.0, 1997.0, 2003.0, 1999.0, 1996.0, 1992.0], dtype='float64')
def analyze_missing_models(df, brand):
# Focus on the brand
brand_df = df[df['brand'] == brand]
# Step 1: Check which vehicle types are most common for missing models
vt_counts = brand_df[brand_df['model'].isna()]['vehicletype'].value_counts()
print(f"\n--- {brand.upper()} ---")
print("Vehicle types with missing models:")
print(vt_counts)
# Step 2: For each vehicle type, show power distribution
for vt in vt_counts.index:
subset = brand_df[(brand_df['model'].isna()) & (brand_df['vehicletype'] == vt)]
pw_counts = subset['power'].value_counts()
print(f"\n{vt}: Power distribution for missing models")
print(pw_counts)
print(pw_counts.index)
# Step 3: Show registration year distribution
reg_counts = subset['registrationyear'].value_counts()
print(f"\n{vt}: Registration year distribution for missing models")
print(reg_counts)
print(reg_counts.index)
analyze_missing_models(df_new, 'mercedes_benz')
--- MERCEDES_BENZ ---
Vehicle types with missing models:
sedan 315
wagon 136
coupe 64
bus 37
convertible 24
other 16
suv 15
small 14
Name: vehicletype, dtype: int64
sedan: Power distribution for missing models
0 93
136 23
122 21
170 16
150 15
224 12
143 11
204 10
163 10
75 8
109 8
160 6
306 6
125 6
184 5
177 5
193 4
116 4
118 3
102 3
197 3
95 3
132 3
108 2
190 2
220 2
90 2
87 2
65 2
272 2
265 1
234 1
300 1
278 1
218 1
387 1
388 1
156 1
186 1
16051 1
174 1
166 1
161 1
52 1
142 1
140 1
123 1
110 1
107 1
103 1
88 1
86 1
10912 1
Name: power, dtype: int64
Int64Index([ 0, 136, 122, 170, 150, 224, 143, 204, 163,
75, 109, 160, 306, 125, 184, 177, 193, 116,
118, 102, 197, 95, 132, 108, 190, 220, 90,
87, 65, 272, 265, 234, 300, 278, 218, 387,
388, 156, 186, 16051, 174, 166, 161, 52, 142,
140, 123, 110, 107, 103, 88, 86, 10912],
dtype='int64')
sedan: Registration year distribution for missing models
2002.0 26
1999.0 20
2000.0 20
1996.0 19
2001.0 19
2003.0 17
1992.0 17
1990.0 13
1997.0 13
1998.0 13
1991.0 13
2005.0 12
2007.0 11
1989.0 10
2008.0 10
1995.0 10
2006.0 9
1993.0 8
2004.0 7
1982.0 7
1994.0 7
1987.0 5
1986.0 5
1988.0 4
2010.0 4
1983.0 4
1985.0 2
1981.0 2
1968.0 1
1967.0 1
1974.0 1
1966.0 1
2012.0 1
1976.0 1
2009.0 1
1971.0 1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 1999.0, 2000.0, 1996.0, 2001.0, 2003.0, 1992.0, 1990.0,
1997.0, 1998.0, 1991.0, 2005.0, 2007.0, 1989.0, 2008.0, 1995.0,
2006.0, 1993.0, 2004.0, 1982.0, 1994.0, 1987.0, 1986.0, 1988.0,
2010.0, 1983.0, 1985.0, 1981.0, 1968.0, 1967.0, 1974.0, 1966.0,
2012.0, 1976.0, 2009.0, 1971.0],
dtype='float64')
wagon: Power distribution for missing models
0 34
150 16
122 14
170 11
136 8
163 8
116 7
204 5
143 5
125 5
90 4
224 4
130 2
177 2
193 2
132 1
272 1
115 1
102 1
165 1
184 1
196 1
280 1
205 1
Name: power, dtype: int64
Int64Index([ 0, 150, 122, 170, 136, 163, 116, 204, 143, 125, 90, 224, 130,
177, 193, 132, 272, 115, 102, 165, 184, 196, 280, 205],
dtype='int64')
wagon: Registration year distribution for missing models
1997.0 17
1998.0 15
2003.0 14
2002.0 12
2008.0 10
2000.0 8
1999.0 8
2001.0 7
1996.0 6
2004.0 6
2006.0 5
1989.0 5
2005.0 4
2010.0 4
1994.0 3
1993.0 3
1992.0 2
1991.0 2
2007.0 2
1995.0 2
2009.0 1
Name: registrationyear, dtype: int64
Float64Index([1997.0, 1998.0, 2003.0, 2002.0, 2008.0, 2000.0, 1999.0, 2001.0,
1996.0, 2004.0, 2006.0, 1989.0, 2005.0, 2010.0, 1994.0, 1993.0,
1992.0, 1991.0, 2007.0, 1995.0, 2009.0],
dtype='float64')
coupe: Power distribution for missing models
0 9
163 8
136 5
306 5
170 5
197 4
200 3
272 3
192 2
305 2
109 2
231 2
224 2
218 2
143 2
132 2
186 1
193 1
150 1
208 1
500 1
122 1
Name: power, dtype: int64
Int64Index([ 0, 163, 136, 306, 170, 197, 200, 272, 192, 305, 109, 231, 224,
218, 143, 132, 186, 193, 150, 208, 500, 122],
dtype='int64')
coupe: Registration year distribution for missing models
2002.0 13
2000.0 6
2004.0 5
2001.0 5
2005.0 4
2006.0 4
2003.0 3
1999.0 3
1982.0 3
1998.0 3
1978.0 2
2007.0 2
1988.0 2
1991.0 1
2008.0 1
1997.0 1
2010.0 1
1972.0 1
1995.0 1
1990.0 1
1992.0 1
1984.0 1
Name: registrationyear, dtype: int64
Float64Index([2002.0, 2000.0, 2004.0, 2001.0, 2005.0, 2006.0, 2003.0, 1999.0,
1982.0, 1998.0, 1978.0, 2007.0, 1988.0, 1991.0, 2008.0, 1997.0,
2010.0, 1972.0, 1995.0, 1990.0, 1992.0, 1984.0],
dtype='float64')
bus: Power distribution for missing models
0 14
122 5
150 4
70 2
129 1
130 1
200 1
85 1
90 1
156 1
95 1
100 1
109 1
110 1
116 1
55 1
Name: power, dtype: int64
Int64Index([0, 122, 150, 70, 129, 130, 200, 85, 90, 156, 95, 100, 109, 110,
116, 55],
dtype='int64')
bus: Registration year distribution for missing models
2001.0 7
2002.0 5
2006.0 4
2008.0 3
2007.0 3
2000.0 3
1994.0 2
2005.0 2
2004.0 2
2009.0 2
2003.0 2
1998.0 1
1999.0 1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 2002.0, 2006.0, 2008.0, 2007.0, 2000.0, 1994.0, 2005.0,
2004.0, 2009.0, 2003.0, 1998.0, 1999.0],
dtype='float64')
convertible: Power distribution for missing models
0 6
163 4
326 3
136 2
170 2
193 1
231 1
198 1
240 1
218 1
220 1
168 1
Name: power, dtype: int64
Int64Index([0, 163, 326, 136, 170, 193, 231, 198, 240, 218, 220, 168], dtype='int64')
convertible: Registration year distribution for missing models
2004.0 5
2001.0 3
1992.0 3
2000.0 3
2007.0 2
2002.0 2
1984.0 1
1993.0 1
1968.0 1
1960.0 1
1998.0 1
2005.0 1
Name: registrationyear, dtype: int64
Float64Index([2004.0, 2001.0, 1992.0, 2000.0, 2007.0, 2002.0, 1984.0, 1993.0,
1968.0, 1960.0, 1998.0, 2005.0],
dtype='float64')
other: Power distribution for missing models
0 8
75 2
129 1
99 1
116 1
72 1
90 1
79 1
Name: power, dtype: int64
Int64Index([0, 75, 129, 99, 116, 72, 90, 79], dtype='int64')
other: Registration year distribution for missing models
2001.0 2
2006.0 1
1999.0 1
2016.0 1
1992.0 1
2013.0 1
1981.0 1
2007.0 1
1971.0 1
1997.0 1
1988.0 1
1983.0 1
2000.0 1
1993.0 1
1991.0 1
Name: registrationyear, dtype: int64
Float64Index([2001.0, 2006.0, 1999.0, 2016.0, 1992.0, 2013.0, 1981.0, 2007.0,
1971.0, 1997.0, 1988.0, 1983.0, 2000.0, 1993.0, 1991.0],
dtype='float64')
suv: Power distribution for missing models
0 3
190 3
163 2
165 2
224 2
150 1
167 1
250 1
Name: power, dtype: int64
Int64Index([0, 190, 163, 165, 224, 150, 167, 250], dtype='int64')
suv: Registration year distribution for missing models
2007.0 4
2001.0 2
2000.0 2
2008.0 1
1998.0 1
1989.0 1
2002.0 1
2003.0 1
2005.0 1
2006.0 1
Name: registrationyear, dtype: int64
Float64Index([2007.0, 2001.0, 2000.0, 2008.0, 1998.0, 1989.0, 2002.0, 2003.0,
2005.0, 2006.0],
dtype='float64')
small: Power distribution for missing models
0 6
75 2
108 2
74 1
90 1
125 1
62 1
Name: power, dtype: int64
Int64Index([0, 75, 108, 74, 90, 125, 62], dtype='int64')
small: Registration year distribution for missing models
2000.0 5
1998.0 2
2004.0 2
2006.0 2
2008.0 1
1990.0 1
2002.0 1
Name: registrationyear, dtype: int64
Float64Index([2000.0, 1998.0, 2004.0, 2006.0, 2008.0, 1990.0, 2002.0], dtype='float64')
df_new[(df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([102,82,88])) & \
(df_new['model'].isna())].value_counts(subset = 'registrationyear').index
Float64Index([2000.0, 1989.0, 1999.0], dtype='float64', name='registrationyear')
df_new[(df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['registrationyear'].isin([2000])) & (df_new['model'].isna())].value_counts(subset = 'power').index
Int64Index([0, 163, 143, 170, 88, 102, 116, 160, 197, 265, 306], dtype='int64', name='power')
df_new[(df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([102,82,88])) & \
(df_new['registrationyear'].isin([2000])) & (df_new['model'].notna())].value_counts(subset = 'model')
model a_klasse 138 c_klasse 6 e_klasse 1 dtype: int64
c = (df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([122])) & (df_new['registrationyear'].isin([1996.0, 1994.0, 1995.0, 1997.0, 1998.0, 1999.0])) & (df_new['model'].isna())
df_new.loc[c,['model']] = 'c_klasse'
e = (df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([150, 177,306])) & (df_new['registrationyear'].isin([2002])) & (df_new['model'].isna())
df_new.loc[e,['model']] = 'e_klasse'
a = (df_new['brand'] == 'mercedes_benz') & (df_new['vehicletype'].isin(['sedan'])) & (df_new['power'].isin([102,82,88])) & (df_new['registrationyear'].isin([2000])) & (df_new['model'].isna())
df_new.loc[a,['model']] = 'a_klasse'
def fill_missing_models_majority(df, threshold=0.9):
df = df.copy()
# Split known vs missing
known = df[df['model'].notna()]
missing = df[df['model'].isna()]
# Step 1: compute dominant model per combo and its proportion
model_stats = (
known.groupby(['brand', 'vehicletype', 'power', 'registrationyear', 'model'])
.size()
.groupby(level=[0, 1, 2, 3])
.apply(lambda x: x / x.sum()) # convert to proportions
.reset_index(name='model_share')
)
# Step 2: keep only those combos where a single model dominates (≥ threshold)
dominant = (
model_stats[model_stats['model_share'] >= threshold]
.sort_values('model_share', ascending=False)
.drop_duplicates(subset=['brand', 'vehicletype', 'power', 'registrationyear'])
)
# Step 3: merge and fill
filled = missing.merge(
dominant[['brand', 'vehicletype', 'power', 'registrationyear', 'model']],
on=['brand', 'vehicletype', 'power', 'registrationyear'],
how='left',
suffixes=('', '_pred')
)
filled['model'] = filled['model_pred'].combine_first(filled['model'])
filled = filled.drop(columns=['model_pred'])
# Step 4: combine back with known
result = pd.concat([known, filled], ignore_index=True)
return result
df_new.isna().sum()
datecrawled 0 price 0 vehicletype 37471 registrationyear 0 gearbox 19830 power 0 model 15644 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 dtype: int64
df_newer = fill_missing_models_majority(df_new, threshold = 0.9)
df_newer.isna().sum()
datecrawled 0 price 0 vehicletype 37471 registrationyear 0 gearbox 19830 power 0 model 14878 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 dtype: int64
def fill_missing_vehicletype(df, threshold=0.9):
df = df.copy()
# Split known vs missing
known = df[df['vehicletype'].notna()]
missing = df[df['vehicletype'].isna()]
if missing.empty:
return df # nothing to fill
# Step 1: compute dominant vehicletype per group
vehicletype_stats = (
known.groupby(['brand', 'model', 'power', 'registrationyear'])['vehicletype']
.value_counts(normalize=True) # fraction per type
.rename('fraction')
.reset_index()
)
# Keep only dominant types above threshold
dominant_types = (
vehicletype_stats[vehicletype_stats['fraction'] >= threshold]
.sort_values('fraction', ascending=False)
.drop_duplicates(subset=['brand', 'model', 'power', 'registrationyear'])
)
# Step 2: merge dominant types into missing
missing_filled = missing.merge(
dominant_types[['brand','model','power','registrationyear','vehicletype']],
on=['brand','model','power','registrationyear'],
how='left',
suffixes=('', '_pred')
)
# Fill in vehicletype from dominant type
missing_filled['vehicletype'] = missing_filled['vehicletype_pred'].combine_first(missing_filled['vehicletype'])
missing_filled = missing_filled.drop(columns=['vehicletype_pred'])
# Step 3: combine back with known
result = pd.concat([known, missing_filled], ignore_index=True)
return result
df_newest = fill_missing_vehicletype(df_newer)
df_newest.isna().sum()
datecrawled 0 price 0 vehicletype 33013 registrationyear 0 gearbox 19830 power 0 model 14878 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 dtype: int64
def fill_zero_power(df, group_cols=None, threshold=0.9):
"""
Fill zero horsepower (HP) values using mode-based imputation with a confidence threshold.
group_cols : list of str, optional
Columns to group by when determining mode HP.
Default: ['brand', 'model', 'fueltype', 'registrationyear']
returns df :
DataFrame with zero HP values filled where confident mode exists.
"""
# Default grouping columns
if group_cols is None:
group_cols = ['brand', 'model', 'vehicletype','fueltype', 'registrationyear']
df = df.copy() # Work on a copy to avoid side effects
# Step 1: Compute mode HP for each group
hp_mode_stats = (
df[df['power'] > 0] # Only consider valid HPs
.groupby(group_cols)['power']
.agg(lambda x: x.mode()[0] if not x.mode().empty else None)
.reset_index(name='mode_hp')
)
# Step 2: Compute mode frequency (confidence)
hp_freq_stats = (
df[df['power'] > 0]
.groupby(group_cols)['power']
.value_counts(normalize=True)
.groupby(level=list(range(len(group_cols)))) # Group again by same keys
.max()
.reset_index(name='mode_freq')
)
# Step 3: Keep only groups where mode occurs ≥ threshold fraction of the time
hp_stats = pd.merge(hp_mode_stats, hp_freq_stats, on=group_cols)
hp_stats = hp_stats[hp_stats['mode_freq'] >= threshold]
# Step 4: Merge imputation info back to df
df = df.merge(hp_stats, on=group_cols, how='left')
# Step 5: Fill zeros only where confident mode exists
df['power'] = df.apply(
lambda row: row['mode_hp'] if row['power'] == 0 and pd.notna(row['mode_hp']) else row['power'],
axis=1
)
# Step 6: Clean up helper columns
df = df.drop(columns=['mode_hp', 'mode_freq'], errors='ignore')
return df
df_car = fill_zero_power(df_newest)
df_car.isna().sum()
datecrawled 0 price 0 vehicletype 33013 registrationyear 0 gearbox 19830 power 0 model 14878 mileage 0 registrationmonth 0 fueltype 32889 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 dtype: int64
def fill_missing_fueltype(df, group_cols=None, threshold=0.9):
if group_cols is None:
group_cols = ['brand', 'model', 'power', 'vehicletype', 'registrationyear']
df = df.copy()
# Step 1: Compute mode fueltype per group
fuel_mode_stats = (
df[df['fueltype'].notna()]
.groupby(group_cols)['fueltype']
.agg(lambda x: x.mode()[0] if not x.mode().empty else None)
.reset_index(name='mode_fueltype')
)
# Step 2: Compute how dominant (confident) that mode is
fuel_freq_stats = (
df[df['fueltype'].notna()]
.groupby(group_cols)['fueltype']
.value_counts(normalize=True)
.groupby(level=list(range(len(group_cols))))
.max()
.reset_index(name='mode_freq')
)
# Step 3: Keep only groups with strong mode agreement
fuel_stats = pd.merge(fuel_mode_stats, fuel_freq_stats, on=group_cols)
fuel_stats = fuel_stats[fuel_stats['mode_freq'] >= threshold]
# Step 4: Merge back and fill missing
df = df.merge(fuel_stats, on=group_cols, how='left')
df['fueltype'] = df.apply(
lambda row: row['mode_fueltype'] if pd.isna(row['fueltype']) and pd.notna(row['mode_fueltype'])
else row['fueltype'],
axis=1
)
# Step 5: Clean up helper columns
df = df.drop(columns=['mode_fueltype', 'mode_freq'], errors='ignore')
return df
df_ft = fill_missing_fueltype(df_car)
df_ft.isna().sum()
datecrawled 0 price 0 vehicletype 33013 registrationyear 0 gearbox 19830 power 0 model 14878 mileage 0 registrationmonth 0 fueltype 21712 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 dtype: int64
def fill_missing_models_majority_new(df, threshold=0.9):
df = df.copy()
# --- Step 1: Bin registration year into your custom ranges ---
def categorize_year(year):
if pd.isna(year):
return np.nan
elif year < 1990:
return 'before_1990'
elif year < 2000:
return '1990s'
elif year < 2010:
return '2000s'
else:
return '2010_plus'
df['year_bin'] = df['registrationyear'].apply(categorize_year)
# --- Step 2: Split known and missing models ---
known = df[df['model'].notna()]
missing = df[df['model'].isna()]
# --- Step 3: Compute majority model per group ---
model_counts = (
known.groupby(['brand', 'vehicletype', 'year_bin'])['model']
.value_counts(normalize=True)
.rename('freq')
.reset_index()
)
# Keep only models that dominate a group above the threshold (e.g., 90%)
majority_models = model_counts[model_counts['freq'] >= threshold]
# --- Step 4: Merge and fill missing models ---
filled = missing.merge(
majority_models[['brand', 'vehicletype', 'year_bin', 'model']],
on=['brand', 'vehicletype', 'year_bin'],
how='left',
suffixes=('', '_majority')
)
# Fill missing model where a confident majority exists
filled['model'] = filled['model_majority'].combine_first(filled['model'])
filled.drop(columns=['model_majority'], inplace=True)
# --- Step 5: Combine back ---
result = pd.concat([known, filled], ignore_index=True)
return result
df_model = fill_missing_models_majority_new(df_ft)
df_model.isna().sum()
datecrawled 0 price 0 vehicletype 33013 registrationyear 0 gearbox 19830 power 0 model 14455 mileage 0 registrationmonth 0 fueltype 21712 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 year_bin 0 dtype: int64
def fill_missing_vehicle_type_majority(df, threshold=0.9):
df = df.copy()
# --- Step 1: Split into known and missing ---
known = df[df['vehicletype'].notna()]
missing = df[df['vehicletype'].isna()]
# --- Step 2: Compute majority vehicletype per group ---
vt_counts = (
known.groupby(['brand', 'model', 'year_bin'])['vehicletype']
.value_counts(normalize=True)
.rename('freq')
.reset_index()
)
majority_types = vt_counts[vt_counts['freq'] >= threshold]
# --- Step 3: Merge and fill ---
filled = missing.merge(
majority_types[['brand', 'model', 'year_bin', 'vehicletype']],
on=['brand', 'model', 'year_bin'],
how='left',
suffixes=('', '_majority')
)
filled['vehicletype'] = filled['vehicletype_majority'].combine_first(filled['vehicletype'])
filled.drop(columns=['vehicletype_majority'], inplace=True)
# --- Step 4: Combine back ---
result = pd.concat([known, filled], ignore_index=True)
return result
df_vt = fill_missing_vehicle_type_majority(df_model)
df_vt.isna().sum()
datecrawled 0 price 0 vehicletype 24944 registrationyear 0 gearbox 19830 power 0 model 14455 mileage 0 registrationmonth 0 fueltype 21712 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 year_bin 0 dtype: int64
def fill_missing_models_majority_x(df, threshold=0.9):
df = df.copy()
missing_before = df['model'].isna().sum()
# Define tiered grouping strategies from broad → narrow
groupings = [
['brand', 'vehicletype'],
['brand', 'vehicletype', 'year_bin'],
['brand', 'fueltype', 'vehicletype'],
['brand', 'vehicletype', 'fueltype', 'year_bin']
]
# Iterate through groupings
for cols in groupings:
majority_model = (
df.groupby(cols)['model']
.agg(lambda x: x.mode().iloc[0] if len(x.mode()) > 0 else np.nan)
)
counts = df.groupby(cols)['model'].value_counts(normalize=True).groupby(cols).max()
majority_model = majority_model[counts.reindex(majority_model.index).fillna(False) >= threshold]
df['model'] = df.apply(
lambda row: majority_model.get(tuple(row[c] for c in cols), row['model'])
if pd.isna(row['model'])
else row['model'],
axis=1
)
missing_after = df['model'].isna().sum()
filled = missing_before - missing_after
print(f"✅ Filled {filled} missing models (threshold={threshold:.0%})")
return df
df_model_x = fill_missing_models_majority_x(df_vt)
✅ Filled 112 missing models (threshold=90%)
df_model_x.isna().sum()
datecrawled 0 price 0 vehicletype 24944 registrationyear 0 gearbox 19830 power 0 model 14343 mileage 0 registrationmonth 0 fueltype 21712 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 year_bin 0 dtype: int64
def fill_missing_vehicle_type(df, threshold=0.9):
"""
Fills missing vehicle types based on the most common value within:
1. (model, brand, year_bin, power)
2. (model, brand, year_bin)
3. (model, brand)
Only fills when the confidence (frequency ratio of the mode)
is above the given threshold.
"""
df = df.copy()
def safe_mode(series):
m = series.mode(dropna=True)
return m.iloc[0] if not m.empty else np.nan
# -----------------------------
# STEP 1: (model, brand, year_bin, power)
# -----------------------------
group_cols_detailed = ['model', 'brand', 'year_bin', 'power']
grouped_detailed = df.groupby(group_cols_detailed)['vehicletype']
majority_detailed = grouped_detailed.apply(safe_mode)
confidence_detailed = grouped_detailed.apply(
lambda x: x.value_counts(normalize=True).iloc[0] if not x.dropna().empty else 0
)
majority_detailed = majority_detailed[confidence_detailed >= threshold]
majority_detailed = majority_detailed.rename('majority_type').reset_index()
df = df.merge(majority_detailed, on=group_cols_detailed, how='left')
# -----------------------------
# STEP 2: (model, brand, year_bin)
# -----------------------------
missing_mask = df['vehicletype'].isna() & df['majority_type'].isna()
group_cols_simple = ['model', 'brand', 'year_bin']
grouped_simple = df.groupby(group_cols_simple)['vehicletype']
majority_simple = grouped_simple.apply(safe_mode)
confidence_simple = grouped_simple.apply(
lambda x: x.value_counts(normalize=True).iloc[0] if not x.dropna().empty else 0
)
majority_simple = majority_simple[confidence_simple >= threshold]
majority_simple = majority_simple.rename('fallback_type').reset_index()
df = df.merge(majority_simple, on=group_cols_simple, how='left')
# -----------------------------
# STEP 3: (model, brand)
# -----------------------------
missing_mask_2 = (
df['vehicletype'].isna()
& df['majority_type'].isna()
& df['fallback_type'].isna()
)
group_cols_brand_model = ['model', 'brand']
grouped_brand_model = df.groupby(group_cols_brand_model)['vehicletype']
majority_brand_model = grouped_brand_model.apply(safe_mode)
confidence_brand_model = grouped_brand_model.apply(
lambda x: x.value_counts(normalize=True).iloc[0] if not x.dropna().empty else 0
)
majority_brand_model = majority_brand_model[confidence_brand_model >= threshold]
majority_brand_model = majority_brand_model.rename('bm_type').reset_index()
df = df.merge(majority_brand_model, on=group_cols_brand_model, how='left')
# -----------------------------
# STEP 4: Fill missing progressively
# -----------------------------
df['vehicletype'] = df['vehicletype'].fillna(df['majority_type'])
df['vehicletype'] = df['vehicletype'].fillna(df['fallback_type'])
df['vehicletype'] = df['vehicletype'].fillna(df['bm_type'])
# -----------------------------
# STEP 5: Cleanup
# -----------------------------
df.drop(columns=['majority_type', 'fallback_type', 'bm_type'], inplace=True)
return df
df_vetype = fill_missing_vehicle_type(df_model_x)
df_vetype.isna().sum()
datecrawled 0 price 0 vehicletype 21015 registrationyear 0 gearbox 19830 power 0 model 14343 mileage 0 registrationmonth 0 fueltype 21712 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 year_bin 0 dtype: int64
def fill_all_missing_values(
df,
threshold=0.9,
verbose=True,
repeat_until_no_change=True,
max_loops=5
):
"""
Runs all fill functions in sequence (and optionally repeats)
until no more missing values are filled.
Parameters
----------
df : pd.DataFrame
The input dataframe.
threshold : float, optional (default=0.9)
Confidence threshold for majority-based fills.
verbose : bool, optional (default=True)
Print progress updates.
repeat_until_no_change : bool, optional (default=True)
If True, keeps looping until no new values are filled.
max_loops : int, optional (default=5)
Safety limit for maximum number of full passes.
Returns
-------
df : pd.DataFrame
The filled dataframe.
"""
df = df.copy()
steps = [
("Vehicle Type", fill_missing_vehicle_type),
("Model", fill_missing_models_majority_x),
("Fuel Type", fill_missing_fueltype),
("Power (0 HP)", fill_zero_power)
]
def count_missing(d):
return (
d['vehicletype'].isna().sum(),
d['model'].isna().sum(),
d['fueltype'].isna().sum(),
(d['power'] == 0).sum()
)
last_missing = count_missing(df)
loop = 0
while True:
loop += 1
if verbose:
print(f"\n🔁 Pass {loop} (threshold={threshold:.0%})")
for name, func in steps:
if verbose:
print(f" ▶ Running {name} fill function...")
try:
df = func(df, threshold=threshold)
except TypeError:
df = func(df)
except Exception as e:
print(f" ⚠️ Error in {name}: {e}")
current_missing = count_missing(df)
if verbose:
print(f" Missing counts after pass {loop}:")
print(f" vehicletype: {current_missing[0]:,}")
print(f" model: {current_missing[1]:,}")
print(f" fueltype: {current_missing[2]:,}")
print(f" power==0: {current_missing[3]:,}")
# Stop if no more changes
if not repeat_until_no_change:
break
if current_missing == last_missing:
if verbose:
print("\n✅ No further fills detected — stopping.")
break
if loop >= max_loops:
if verbose:
print("\n⚠️ Reached max loop limit, stopping.")
break
last_missing = current_missing
if verbose:
print("\n🏁 All fill functions completed.\n")
return df
df_car = fill_all_missing_values(df_vetype, threshold=0.7, repeat_until_no_change=True)
🔁 Pass 1 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 878 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 1:
vehicletype: 16,445
model: 13,465
fueltype: 15,722
power==0: 35,309
🔁 Pass 2 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 17 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 2:
vehicletype: 15,387
model: 13,448
fueltype: 15,260
power==0: 35,289
🔁 Pass 3 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 3:
vehicletype: 15,387
model: 13,448
fueltype: 15,251
power==0: 35,289
🔁 Pass 4 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 4:
vehicletype: 15,387
model: 13,448
fueltype: 15,251
power==0: 35,289
✅ No further fills detected — stopping.
🏁 All fill functions completed.
def correct_registration_years(df, threshold=0.9, proximity=1):
"""
Corrects registration years flagged as 'too early' or 'too late'.
Adds ±proximity tolerance when determining majority years.
"""
df = df.copy()
# --- Split flagged vs correct ---
flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
flagged = df[flagged_mask].copy()
correct = df[~flagged_mask].copy()
if flagged.empty:
return df # nothing to fix
# --- Helper: Cluster nearby years (±proximity) ---
def cluster_years(series, proximity=1):
if series.empty:
return np.nan, 0
years = series.dropna().astype(int)
if years.empty:
return np.nan, 0
clusters = []
for y in sorted(years.unique()):
found = False
for cluster in clusters:
if abs(cluster['years'][-1] - y) <= proximity:
cluster['years'].append(y)
cluster['count'] += (years == y).sum()
found = True
break
if not found:
clusters.append({'years': [y], 'count': (years == y).sum()})
top_cluster = max(clusters, key=lambda c: c['count'])
cluster_year = int(np.round(np.mean(top_cluster['years'])))
freq = top_cluster['count'] / len(years)
return cluster_year, freq
# --- Compute majority year per group ---
def get_majority_table(group_cols):
rows = []
for name, group in correct.groupby(group_cols):
year, freq = cluster_years(group['registrationyear'], proximity)
rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])
# Start with detailed grouping
majority_df = get_majority_table(['brand','model','year_bin','power','vehicletype'])
flagged = flagged.merge(majority_df, on=['brand','model','year_bin','power','vehicletype'], how='left')
# --- Fallback using (brand, model, vehicletype) ---
missing_mask = flagged['majority_year'].isna()
if missing_mask.any():
fallback = get_majority_table(['brand','model','year_bin','vehicletype'])
flagged = flagged.merge(
fallback,
on=['brand','model','year_bin','vehicletype'],
how='left',
suffixes=('','_fallback')
)
# Fill in missing majority fields from fallback where possible
for col in ['majority_year','mode_freq','min','max']:
flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])
# Clean up helper columns
flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)
# --- Apply corrections ---
def fill_year(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
return row['majority_year']
elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
return row['min']
elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
return row['max']
else:
return row['registrationyear']
def fill_flag(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
return 'N'
else:
return row['registration_correction']
flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)
# --- Cleanup helper cols ---
flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)
# --- Combine back safely ---
result = pd.concat([correct, flagged], ignore_index=True)
return result
df_reg = correct_registration_years(df_car, threshold = 0.7, proximity = 5)
df_reg[(df_reg['registration_correction'] == "Y: too early") | (df_reg['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 336004 | 11/03/2016 21:39 | 450 | small | 1910.0 | NaN | 0.0 | ka | 5000 | 0 | petrol | ford | NaN | 2016-11-03 | 0 | 24148 | 19/03/2016 08:46 | Y: too early | before_1990 |
| 336011 | 22/03/2016 14:55 | 3299 | sedan | 1989.0 | auto | 132.0 | e_klasse | 150000 | 6 | petrol | mercedes_benz | no | 2016-03-22 | 0 | 63801 | 06/04/2016 05:15 | Y: too early | before_1990 |
| 336013 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 336014 | 29/03/2016 07:58 | 4300 | coupe | 1990.0 | manual | 170.0 | 90 | 150000 | 4 | petrol | audi | NaN | 2016-03-29 | 0 | 13595 | 05/04/2016 18:17 | Y: too late | 1990s |
| 336015 | 28/03/2016 09:53 | 6990 | wagon | 1983.0 | manual | 72.0 | e_klasse | 150000 | 6 | gasoline | mercedes_benz | no | 2016-03-28 | 0 | 31737 | 06/04/2016 11:16 | Y: too early | before_1990 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
7437 rows × 18 columns
df_reg_years = correct_registration_years(df_reg, threshold = 0.7, proximity = 10)
df_reg_years[(df_reg_years['registration_correction'] == "Y: too early") | (df_reg_years['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 346670 | 11/03/2016 21:39 | 450 | small | 1910.0 | NaN | 0.0 | ka | 5000 | 0 | petrol | ford | NaN | 2016-11-03 | 0 | 24148 | 19/03/2016 08:46 | Y: too early | before_1990 |
| 346671 | 22/03/2016 14:55 | 3299 | sedan | 1989.0 | auto | 132.0 | e_klasse | 150000 | 6 | petrol | mercedes_benz | no | 2016-03-22 | 0 | 63801 | 06/04/2016 05:15 | Y: too early | before_1990 |
| 346672 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 346673 | 29/03/2016 07:58 | 4300 | coupe | 1990.0 | manual | 170.0 | 90 | 150000 | 4 | petrol | audi | NaN | 2016-03-29 | 0 | 13595 | 05/04/2016 18:17 | Y: too late | 1990s |
| 346674 | 28/03/2016 09:53 | 6990 | wagon | 1983.0 | manual | 72.0 | e_klasse | 150000 | 6 | gasoline | mercedes_benz | no | 2016-03-28 | 0 | 31737 | 06/04/2016 11:16 | Y: too early | before_1990 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
7208 rows × 18 columns
def correct_registration_years_x(df, threshold=0.9, proximity=1):
"""
Corrects registration years flagged as 'too early' or 'too late'.
Adds ±proximity tolerance when determining majority years.
"""
df = df.copy()
# --- Split flagged vs correct ---
flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
flagged = df[flagged_mask].copy()
correct = df[~flagged_mask].copy()
if flagged.empty:
return df # nothing to fix
# --- Helper: Cluster nearby years (±proximity) ---
def cluster_years(series, proximity=1):
if series.empty:
return np.nan, 0
years = series.dropna().astype(int)
if years.empty:
return np.nan, 0
clusters = []
for y in sorted(years.unique()):
found = False
for cluster in clusters:
if abs(cluster['years'][-1] - y) <= proximity:
cluster['years'].append(y)
cluster['count'] += (years == y).sum()
found = True
break
if not found:
clusters.append({'years': [y], 'count': (years == y).sum()})
top_cluster = max(clusters, key=lambda c: c['count'])
cluster_year = int(np.round(np.mean(top_cluster['years'])))
freq = top_cluster['count'] / len(years)
return cluster_year, freq
# --- Compute majority year per group ---
def get_majority_table(group_cols):
rows = []
for name, group in correct.groupby(group_cols):
year, freq = cluster_years(group['registrationyear'], proximity)
rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])
# Start with detailed grouping
majority_df = get_majority_table(['brand','model','power','vehicletype'])
flagged = flagged.merge(majority_df, on=['brand','model','power','vehicletype'], how='left')
# --- Fallback using (brand, model, vehicletype) ---
missing_mask = flagged['majority_year'].isna()
if missing_mask.any():
fallback = get_majority_table(['brand','model','vehicletype'])
flagged = flagged.merge(
fallback,
on=['brand','model','vehicletype'],
how='left',
suffixes=('','_fallback')
)
# Fill in missing majority fields from fallback where possible
for col in ['majority_year','mode_freq','min','max']:
flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])
# Clean up helper columns
flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)
# --- Apply corrections ---
def fill_year(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
return row['majority_year']
elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
return row['min']
elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
return row['max']
else:
return row['registrationyear']
def fill_flag(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
return 'N'
else:
return row['registration_correction']
flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)
# --- Cleanup helper cols ---
flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)
# --- Combine back safely ---
result = pd.concat([correct, flagged], ignore_index=True)
return result
df_years = correct_registration_years_x(df_reg_years, threshold = 0.7, proximity = 1)
df_years[(df_years['registration_correction'] == "Y: too early") | (df_years['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 346901 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 346904 | 23/03/2016 11:52 | 6900 | sedan | 1996.0 | manual | 105.0 | e_klasse | 150000 | 6 | petrol | mercedes_benz | no | 2016-03-23 | 0 | 86609 | 05/04/2016 11:18 | Y: too early | before_1990 |
| 346905 | 27/03/2016 12:52 | 7900 | sedan | 1996.0 | auto | 194.0 | e_klasse | 125000 | 9 | petrol | mercedes_benz | no | 2016-03-27 | 0 | 80337 | 07/04/2016 08:17 | Y: too early | before_1990 |
| 346909 | 14/03/2016 09:50 | 999 | coupe | 2007.0 | NaN | 0.0 | 1er | 150000 | 8 | NaN | bmw | no | 2016-03-14 | 0 | 76131 | 26/03/2016 02:46 | Y: too early | 1990s |
| 346914 | 31/03/2016 17:37 | 1000 | small | 1994.0 | manual | 0.0 | antara | 70000 | 9 | petrol | opel | no | 2016-03-31 | 0 | 16775 | 06/04/2016 10:45 | Y: too early | 1990s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
5480 rows × 18 columns
del df_reg_years
df_years1 = correct_registration_years_x(df_years, threshold = 0.7, proximity = 3)
df_years1[(df_years1['registration_correction'] == "Y: too early") | (df_years1['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 348627 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 348628 | 23/03/2016 11:52 | 6900 | sedan | 1996.0 | manual | 105.0 | e_klasse | 150000 | 6 | petrol | mercedes_benz | no | 2016-03-23 | 0 | 86609 | 05/04/2016 11:18 | Y: too early | before_1990 |
| 348629 | 27/03/2016 12:52 | 7900 | sedan | 1996.0 | auto | 194.0 | e_klasse | 125000 | 9 | petrol | mercedes_benz | no | 2016-03-27 | 0 | 80337 | 07/04/2016 08:17 | Y: too early | before_1990 |
| 348631 | 31/03/2016 17:37 | 1000 | small | 1994.0 | manual | 0.0 | antara | 70000 | 9 | petrol | opel | no | 2016-03-31 | 0 | 16775 | 06/04/2016 10:45 | Y: too early | 1990s |
| 348632 | 29/03/2016 11:53 | 3500 | sedan | 1998.0 | auto | 185.0 | e_klasse | 150000 | 0 | petrol | mercedes_benz | no | 2016-03-29 | 0 | 15328 | 05/04/2016 21:45 | Y: too early | before_1990 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
5372 rows × 18 columns
del df_years
fix80 = (df_years1['registrationyear'] < 1990) & (df_years1['year_bin'] != 'before_1990')
df_years1.loc[fix80,['year_bin']] = 'before_1990'
fix90 = (df_years1['registrationyear'] > 1989) & (df_years1['registrationyear'] < 2000) & (df_years1['year_bin'] != '1990s')
df_years1.loc[fix90,['year_bin']] = '1990s'
fix00 = (df_years1['registrationyear'] > 1999) & (df_years1['registrationyear'] < 2010) & (df_years1['year_bin'] != '2000s')
df_years1.loc[fix00,['year_bin']] = '2000s'
fix10 = (df_years1['registrationyear'] > 2009) & (df_years1['year_bin'] != '2010_plus')
df_years1.loc[fix10,['year_bin']] = '2010_plus'
trabantfix = (df_years1['brand'] == 'trabant') & (df_years1['model'] == 'other') & (df_years1['registration_correction'] != 'N') & (df_years1['registrationyear'] == 1964)
df_years1.loc[trabantfix, ['registration_correction']] = 'N'
citroenfix = (df_years1['brand'] == 'citroen') & (df_years1['model'] == 'other') & (df_years1['registration_correction'] == 'Y: too early') & (df_years1['registrationyear'] == 1934)
df_years1.loc[citroenfix, ['registration_correction']] = 'N'
df_reg = correct_registration_years(df_years1, threshold = 0.7, proximity = 5)
df_reg[(df_reg['registration_correction'] == "Y: too early") | (df_reg['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 348737 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 348740 | 31/03/2016 17:37 | 1000 | small | 1994.0 | manual | 0.0 | antara | 70000 | 9 | petrol | opel | no | 2016-03-31 | 0 | 16775 | 06/04/2016 10:45 | Y: too early | 1990s |
| 348742 | 21/03/2016 02:00 | 3500 | small | 1992.0 | NaN | 0.0 | e_klasse | 150000 | 1 | NaN | mercedes_benz | no | 2016-03-21 | 0 | 68799 | 06/04/2016 00:44 | Y: too early | 1990s |
| 348743 | 20/03/2016 19:27 | 12500 | suv | 2005.0 | auto | 296.0 | range_rover_evoque | 150000 | 8 | gasoline | land_rover | no | 2016-03-20 | 0 | 61462 | 20/03/2016 20:38 | Y: too early | 2000s |
| 348745 | 10/03/2016 23:44 | 3490 | wagon | 2007.0 | manual | 101.0 | calibra | 150000 | 12 | gasoline | opel | no | 2016-10-03 | 0 | 66953 | 11/03/2016 12:17 | Y: too late | 2000s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
5327 rows × 18 columns
del df_years1
df_app = fill_all_missing_values(df_reg, threshold=0.7, repeat_until_no_change=True)
🔁 Pass 1 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 57 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 1:
vehicletype: 15,376
model: 13,391
fueltype: 14,451
power==0: 34,829
🔁 Pass 2 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 7 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 2:
vehicletype: 15,346
model: 13,384
fueltype: 14,379
power==0: 34,827
🔁 Pass 3 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 2 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 3:
vehicletype: 15,346
model: 13,382
fueltype: 14,371
power==0: 34,827
🔁 Pass 4 (threshold=70%)
▶ Running Vehicle Type fill function...
▶ Running Model fill function...
✅ Filled 0 missing models (threshold=70%)
▶ Running Fuel Type fill function...
▶ Running Power (0 HP) fill function...
Missing counts after pass 4:
vehicletype: 15,346
model: 13,382
fueltype: 14,371
power==0: 34,827
✅ No further fills detected — stopping.
🏁 All fill functions completed.
df_reg = correct_registration_years(df_app, threshold = 0.7, proximity = 10)
df_reg[(df_reg['registration_correction'] == "Y: too early") | (df_reg['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 348780 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 348781 | 31/03/2016 17:37 | 1000 | small | 1994.0 | manual | 0.0 | antara | 70000 | 9 | petrol | opel | no | 2016-03-31 | 0 | 16775 | 06/04/2016 10:45 | Y: too early | 1990s |
| 348782 | 21/03/2016 02:00 | 3500 | small | 1992.0 | NaN | 0.0 | e_klasse | 150000 | 1 | NaN | mercedes_benz | no | 2016-03-21 | 0 | 68799 | 06/04/2016 00:44 | Y: too early | 1990s |
| 348783 | 20/03/2016 19:27 | 12500 | suv | 2005.0 | auto | 296.0 | range_rover_evoque | 150000 | 8 | gasoline | land_rover | no | 2016-03-20 | 0 | 61462 | 20/03/2016 20:38 | Y: too early | 2000s |
| 348784 | 10/03/2016 23:44 | 3490 | wagon | 2007.0 | manual | 101.0 | calibra | 150000 | 12 | gasoline | opel | no | 2016-10-03 | 0 | 66953 | 11/03/2016 12:17 | Y: too late | 2000s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
5324 rows × 18 columns
del df_app
def correct_registration_years1(df, threshold=0.9, proximity=1):
"""
Corrects registration years flagged as 'too early' or 'too late'.
Adds ±proximity tolerance when determining majority years.
"""
df = df.copy()
# --- Split flagged vs correct ---
flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
flagged = df[flagged_mask].copy()
correct = df[~flagged_mask].copy()
if flagged.empty:
return df # nothing to fix
# --- Helper: Cluster nearby years (±proximity) ---
def cluster_years(series, proximity=1):
if series.empty:
return np.nan, 0
years = series.dropna().astype(int)
if years.empty:
return np.nan, 0
clusters = []
for y in sorted(years.unique()):
found = False
for cluster in clusters:
if abs(cluster['years'][-1] - y) <= proximity:
cluster['years'].append(y)
cluster['count'] += (years == y).sum()
found = True
break
if not found:
clusters.append({'years': [y], 'count': (years == y).sum()})
top_cluster = max(clusters, key=lambda c: c['count'])
cluster_year = int(np.round(np.mean(top_cluster['years'])))
freq = top_cluster['count'] / len(years)
return cluster_year, freq
# --- Compute majority year per group ---
def get_majority_table(group_cols):
rows = []
for name, group in correct.groupby(group_cols):
year, freq = cluster_years(group['registrationyear'], proximity)
rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])
# Start with detailed grouping
majority_df = get_majority_table(['brand','model','year_bin','power'])
flagged = flagged.merge(majority_df, on=['brand','model','year_bin','power'], how='left')
# --- Fallback using (brand, model, yearbin) ---
missing_mask = flagged['majority_year'].isna()
if missing_mask.any():
fallback = get_majority_table(['brand','model','year_bin'])
flagged = flagged.merge(
fallback,
on=['brand','model','year_bin',],
how='left',
suffixes=('','_fallback')
)
# Fill in missing majority fields from fallback where possible
for col in ['majority_year','mode_freq','min','max']:
flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])
# Clean up helper columns
flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)
# --- Apply corrections ---
def fill_year(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
return row['majority_year']
elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
return row['min']
elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
return row['max']
else:
return row['registrationyear']
def fill_flag(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
return 'N'
else:
return row['registration_correction']
flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)
# --- Cleanup helper cols ---
flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)
# --- Combine back safely ---
result = pd.concat([correct, flagged], ignore_index=True)
return result
df_reg1 = correct_registration_years1(df_reg, threshold = 0.7, proximity = 5)
df_reg1[(df_reg1['registration_correction'] == "Y: too early") | (df_reg1['registration_correction'] == "Y: too late")]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 348783 | 16/03/2016 13:45 | 140 | small | 1986.0 | NaN | 0.0 | cayenne | 20000 | 0 | petrol | porsche | NaN | 2016-03-16 | 0 | 25860 | 17/03/2016 11:17 | Y: too early | before_1990 |
| 348784 | 31/03/2016 17:37 | 1000 | small | 1994.0 | manual | 0.0 | antara | 70000 | 9 | petrol | opel | no | 2016-03-31 | 0 | 16775 | 06/04/2016 10:45 | Y: too early | 1990s |
| 348786 | 20/03/2016 19:27 | 12500 | suv | 2005.0 | auto | 296.0 | range_rover_evoque | 150000 | 8 | gasoline | land_rover | no | 2016-03-20 | 0 | 61462 | 20/03/2016 20:38 | Y: too early | 2000s |
| 348787 | 10/03/2016 23:44 | 3490 | wagon | 2007.0 | manual | 101.0 | calibra | 150000 | 12 | gasoline | opel | no | 2016-10-03 | 0 | 66953 | 11/03/2016 12:17 | Y: too late | 2000s |
| 348788 | 21/03/2016 12:51 | 1400 | coupe | 1999.0 | auto | 196.0 | glk | 150000 | 7 | petrol | mercedes_benz | no | 2016-03-21 | 0 | 47441 | 21/03/2016 12:51 | Y: too early | 1990s |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 354102 | 16/03/2016 17:51 | 2300 | NaN | 2017.0 | auto | 192.0 | NaN | 150000 | 0 | NaN | bmw | no | 2016-03-16 | 0 | 45896 | 17/03/2016 16:17 | Y: too late | 2010_plus |
| 354103 | 24/03/2016 16:54 | 900 | NaN | 2017.0 | manual | 101.0 | NaN | 150000 | 6 | NaN | opel | NaN | 2016-03-24 | 0 | 50170 | 07/04/2016 09:17 | Y: too late | 2010_plus |
| 354104 | 07/04/2016 08:36 | 1670 | NaN | 2018.0 | manual | 0.0 | NaN | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | Y: too late | 2010_plus |
| 354105 | 04/04/2016 21:40 | 10980 | NaN | 2018.0 | manual | 75.0 | NaN | 20000 | 1 | NaN | volkswagen | no | 2016-04-04 | 0 | 44801 | 07/04/2016 00:15 | Y: too late | 2010_plus |
| 354106 | 01/04/2016 02:36 | 1000 | NaN | 2017.0 | manual | 54.0 | NaN | 125000 | 2 | NaN | hyundai | no | 2016-01-04 | 0 | 67547 | 05/04/2016 02:45 | Y: too late | 2010_plus |
2524 rows × 18 columns
def correct_registration_years2(df, threshold=0.9, proximity=1):
"""
Corrects registration years flagged as 'too early' or 'too late'.
Adds ±proximity tolerance when determining majority years.
"""
df = df.copy()
# --- Split flagged vs correct ---
flagged_mask = df['registration_correction'].isin(['Y: too early', 'Y: too late'])
flagged = df[flagged_mask].copy()
correct = df[~flagged_mask].copy()
if flagged.empty:
return df # nothing to fix
# --- Helper: Cluster nearby years (±proximity) ---
def cluster_years(series, proximity=1):
if series.empty:
return np.nan, 0
years = series.dropna().astype(int)
if years.empty:
return np.nan, 0
clusters = []
for y in sorted(years.unique()):
found = False
for cluster in clusters:
if abs(cluster['years'][-1] - y) <= proximity:
cluster['years'].append(y)
cluster['count'] += (years == y).sum()
found = True
break
if not found:
clusters.append({'years': [y], 'count': (years == y).sum()})
top_cluster = max(clusters, key=lambda c: c['count'])
cluster_year = int(np.round(np.mean(top_cluster['years'])))
freq = top_cluster['count'] / len(years)
return cluster_year, freq
# --- Compute majority year per group ---
def get_majority_table(group_cols):
rows = []
for name, group in correct.groupby(group_cols):
year, freq = cluster_years(group['registrationyear'], proximity)
rows.append((*name, year, freq, group['registrationyear'].min(), group['registrationyear'].max()))
return pd.DataFrame(rows, columns=group_cols + ['majority_year','mode_freq','min','max'])
# Start with detailed grouping
majority_df = get_majority_table(['brand','model','year_bin'])
flagged = flagged.merge(majority_df, on=['brand','model','year_bin'], how='left')
# --- Fallback using (brand, yearbin) ---
missing_mask = flagged['majority_year'].isna()
if missing_mask.any():
fallback = get_majority_table(['brand','year_bin'])
flagged = flagged.merge(
fallback,
on=['brand','year_bin',],
how='left',
suffixes=('','_fallback')
)
# Fill in missing majority fields from fallback where possible
for col in ['majority_year','mode_freq','min','max']:
flagged[col] = flagged[col].fillna(flagged[f"{col}_fallback"])
# Clean up helper columns
flagged.drop(columns=[c for c in flagged.columns if c.endswith('_fallback')], inplace=True)
# --- Apply corrections ---
def fill_year(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold and pd.notna(row['majority_year']):
return row['majority_year']
elif row['registration_correction'] == 'Y: too early' and pd.notna(row['min']):
return row['min']
elif row['registration_correction'] == 'Y: too late' and pd.notna(row['max']):
return row['max']
else:
return row['registrationyear']
def fill_flag(row):
if pd.notna(row['mode_freq']) and row['mode_freq'] >= threshold:
return 'N'
else:
return row['registration_correction']
flagged['registrationyear'] = flagged.apply(fill_year, axis=1)
flagged['registration_correction'] = flagged.apply(fill_flag, axis=1)
# --- Cleanup helper cols ---
flagged.drop(columns=['majority_year','mode_freq','min','max'], inplace=True)
# --- Combine back safely ---
result = pd.concat([correct, flagged], ignore_index=True)
return result
df_reg2 = correct_registration_years2(df_reg1, threshold = 0.7, proximity = 5)
df_reg2[(df_reg2['registration_correction'] == "Y: too early") | (df_reg2['registration_correction'] == "Y: too late")].head()
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 351589 | 04/04/2016 20:57 | 14500 | suv | 2015.0 | manual | 26.0 | 601 | 40000 | 4 | petrol | trabant | no | 2016-04-04 | 0 | 98704 | 06/04/2016 23:44 | Y: too late | 2010_plus |
| 351633 | 27/03/2016 13:46 | 2300 | suv | 2017.0 | manual | 26.0 | 601 | 70000 | 1 | other | trabant | no | 2016-03-27 | 0 | 39443 | 07/04/2016 09:45 | Y: too late | 2010_plus |
| 351659 | 26/03/2016 13:46 | 2190 | suv | 2017.0 | manual | 0.0 | 601 | 50000 | 1 | petrol | trabant | NaN | 2016-03-26 | 0 | 98617 | 06/04/2016 01:44 | Y: too late | 2010_plus |
| 351717 | 21/03/2016 20:56 | 1900 | suv | 2016.0 | NaN | 26.0 | 601 | 30000 | 6 | petrol | trabant | NaN | 2016-03-21 | 0 | 16259 | 07/04/2016 00:44 | Y: too late | 2010_plus |
| 351749 | 22/03/2016 18:46 | 150 | NaN | 2015.0 | NaN | 0.0 | other | 80000 | 0 | NaN | trabant | NaN | 2016-03-22 | 0 | 39340 | 22/03/2016 18:46 | Y: too late | 2010_plus |
del df_reg1
df_reg2[(df_reg2['model'] == '601') & (df_reg2['registration_correction'] == "N")].median()
price 900.0 registrationyear 1987.0 power 26.0 model 601.0 mileage 50000.0 registrationmonth 2.0 numberofpictures 0.0 postalcode 16759.5 dtype: float64
tra601 = (df_reg2['model'] == '601') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[tra601,['registrationyear']] = 1987
df_reg2.loc[tra601,['registration_correction']] = "N"
df_reg2[(df_reg2['brand'] == 'trabant') & (df_reg2['registration_correction'] == "N")].median()
price 945.0 registrationyear 1987.0 power 26.0 mileage 50000.0 registrationmonth 1.0 numberofpictures 0.0 postalcode 16547.0 dtype: float64
tratra = (df_reg2['brand'] == 'trabant') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[tratra,['registrationyear']] = 1987
df_reg2.loc[tratra,['registration_correction']] = "N"
df_reg2[(df_reg2['brand'] == 'rover') & (df_reg2['registration_correction'] == "N")].median()
price 949.0 registrationyear 1999.0 power 103.0 mileage 150000.0 registrationmonth 6.0 numberofpictures 0.0 postalcode 45881.0 dtype: float64
rover = (df_reg2['brand'] == 'rover') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[rover,['registrationyear']] = 1999
df_reg2.loc[rover,['registration_correction']] = "N"
df_reg2[(df_reg2['brand'] == 'hyundai') & (df_reg2['registration_correction'] == "N")].median()
price 3850.0 registrationyear 2007.0 power 102.0 mileage 125000.0 registrationmonth 6.0 numberofpictures 0.0 postalcode 49586.0 dtype: float64
hyun = (df_reg2['brand'] == 'hyundai') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[hyun,['registrationyear']] = 2007
df_reg2.loc[hyun,['registration_correction']] = "N"
df_reg2[(df_reg2['brand'] == 'audi') & (df_reg2['model'] == 'q7') & (df_reg2['registration_correction'] == "N")].median()
price 15500.0 registrationyear 2007.0 power 233.0 mileage 150000.0 registrationmonth 8.0 numberofpictures 0.0 postalcode 46485.0 dtype: float64
aq7 = (df_reg2['brand'] == 'audi') & (df_reg2['model'] == 'q7') & (df_reg2['registration_correction'] != "N")
df_reg2.loc[aq7,['registrationyear']] = 2007
df_reg2.loc[aq7,['registration_correction']] = "N"
fix80 = (df_reg2['registrationyear'] < 1990) & (df_reg2['year_bin'] != 'before_1990')
df_reg2.loc[fix80,['year_bin']] = 'before_1990'
fix90 = (df_reg2['registrationyear'] > 1989) & (df_reg2['registrationyear'] < 2000) & (df_reg2['year_bin'] != '1990s')
df_reg2.loc[fix90,['year_bin']] = '1990s'
fix00 = (df_reg2['registrationyear'] > 1999) & (df_reg2['registrationyear'] < 2010) & (df_reg2['year_bin'] != '2000s')
df_reg2.loc[fix00,['year_bin']] = '2000s'
fix10 = (df_reg2['registrationyear'] > 2009) & (df_reg2['year_bin'] != '2010_plus')
df_reg2.loc[fix10,['year_bin']] = '2010_plus'
df_app3 = fill_missing_vehicle_type(df_reg2, threshold = 0.7)
df_app3.isna().sum()
datecrawled 0 price 0 vehicletype 15345 registrationyear 0 gearbox 19830 power 0 model 13382 mileage 0 registrationmonth 0 fueltype 14371 brand 0 notrepaired 71145 datecreated 0 numberofpictures 0 postalcode 0 lastseen 0 registration_correction 0 year_bin 0 dtype: int64
del df_reg2
too_high_hp = (df_app3['power'] > 999)
df_app3.loc[too_high_hp,['power']] = 0
hp_toohigh = (df_app3['power'] > 621) & (df_app3['model'] != 'other') & (df_app3['model'] != '5er')
df_app3.loc[hp_toohigh,['power']] = 0
hp_high = (df_app3['power'] > 450) & (~(df_app3['brand'].isin(['mercedes_benz','audi','bmw','porsche','ford']))) & (df_app3['model'] != 'other')
df_app3.loc[hp_high,['power']] = 0
vwgolfhigh = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'golf') & (df_app3['power'] > 306)
df_app3.loc[vwgolfhigh,['power']] = 0
polohighe = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'polo') & (df_app3['power'] > 200)
df_app3.loc[polohighe,['power']] = 0
passathigh = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'passat') & (df_app3['power'] > 300)
df_app3.loc[passathigh,['power']] = 0
jagxhigh = (df_app3['brand'] == 'jaguar') & (df_app3['model'] == 'x_type') & (df_app3['power'] > 240)
df_app3.loc[jagxhigh,['power']] = 0
captivahigh = (df_app3['brand'] == 'chevrolet') & (df_app3['model'] == 'captiva') & (df_app3['power'] > 258)
df_app3.loc[captivahigh,['power']] = 0
vwhigh = (df_app3['brand'] == 'volkswagen') & (df_app3['power'] > 420)
df_app3.loc[vwhigh,['power']] = 0
citroenhigh = (df_app3['brand'] == 'citroen') & (df_app3['power'] > 241)
df_app3.loc[citroenhigh,['power']] = 0
chryslerhigh = (df_app3['brand'] == 'chrysler') & (df_app3['power'] > 470)
df_app3.loc[chryslerhigh,['power']] = 0
fiathigh = (df_app3['brand'] == 'fiat') & (df_app3['power'] > 220)
df_app3.loc[fiathigh,['power']] = 0
suzukihigh = (df_app3['brand'] == 'suzuki') & (df_app3['power'] > 290)
df_app3.loc[suzukihigh,['power']] = 0
arhigh = (df_app3['brand'] == 'alfa_romeo') & (df_app3['power'] > 505)
df_app3.loc[arhigh,['power']] = 0
fordhigh = (df_app3['brand'] == 'ford') & (df_app3['power'] > 760)
df_app3.loc[fordhigh,['power']] = 0
chevyhigh = (df_app3['brand'] == 'chevrolet') & (df_app3['power'] > 650)
df_app3.loc[chevyhigh,['power']] = 0
hyundaihigh = (df_app3['brand'] == 'hyundai') & (df_app3['power'] > 370)
df_app3.loc[hyundaihigh,['power']] = 0
mitsubishihigh = (df_app3['brand'] == 'mitsubishi') & (df_app3['power'] > 440)
df_app3.loc[mitsubishihigh,['power']] = 0
nissanhigh = (df_app3['brand'] == 'nissan') & (df_app3['power'] > 600)
df_app3.loc[nissanhigh,['power']] = 0
opelhigh = (df_app3['brand'] == 'opel') & (df_app3['power'] > 577)
df_app3.loc[opelhigh,['power']] = 0
pehigh = (df_app3['brand'] == 'peugeot') & (df_app3['power'] > 360)
df_app3.loc[pehigh,['power']] = 0
seathigh = (df_app3['brand'] == 'seat') & (df_app3['power'] > 340)
df_app3.loc[seathigh,['power']] = 0
volvohigh = (df_app3['brand'] == 'volvo') & (df_app3['power'] > 510)
df_app3.loc[volvohigh,['power']] = 0
smarthigh = (df_app3['brand'] == 'smart') & (df_app3['power'] > 422)
df_app3.loc[smarthigh,['power']] = 0
del too_high_hp
del hp_toohigh
del hp_high
del vwgolfhigh
del polohighe
del passathigh
del jagxhigh
del captivahigh
del vwhigh
del citroenhigh
del chryslerhigh
del fiathigh
del suzukihigh
del arhigh
del fordhigh
del chevyhigh
del hyundaihigh
del mitsubishihigh
del nissanhigh
del opelhigh
del pehigh
del seathigh
del volvohigh
del smarthigh
gc.collect()
0
too_low = (df_app3['power']>0) & (df_app3['power']<5)
df_app3.loc[too_low,['power']] = 0
bmw = (df_app3['brand'] == 'bmw') & (df_app3['model'] == 'bmw')
df_app3.loc[bmw,['model']] = None
opellow = (df_app3['brand'] == 'opel') & (df_app3['power'] > 0) & (df_app3['power']<40) & (df_app3['model'] != 'other')
df_app3.loc[opellow,['power']] = 0
vwlow = (df_app3['brand'] == 'volkswagen') & (df_app3['power']>0) & (df_app3['power']<30) & (df_app3['model'] != 'other')
df_app3.loc[vwlow,['power']] = 0
citroenlow = (df_app3['brand'] == 'citroen') & (df_app3['power']>0) & (df_app3['power'] < 32) & (df_app3['model'] != 'other')
df_app3.loc[citroenlow,['power']] = 0
fordlow = (df_app3['brand'] == 'ford') & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[fordlow,['power']] = 0
renaultlow = (df_app3['brand'] == 'renault') & (df_app3['power']>0) & (df_app3['power'] < 32) & (df_app3['model'] != 'other')
df_app3.loc[renaultlow,['power']] = 0
merclow = (df_app3['brand'] == 'mercedes_benz') & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[merclow,['power']] = 0
bmwlow = (df_app3['brand'] == 'bmw') & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[bmwlow,['power']] = 0
audilow = (df_app3['brand'] == 'audi') & (df_app3['power']>0) & (df_app3['power'] < 44) & (df_app3['model'] != 'other')
df_app3.loc[audilow,['power']] = 0
fiatlow = (df_app3['brand'] == 'fiat') & (df_app3['power']>0) & (df_app3['power'] < 13) & (df_app3['model'] != 'other')
df_app3.loc[fiatlow,['power']] = 0
pelow = (df_app3['brand'] == 'peugeot') & (df_app3['power']>0) & (df_app3['power'] < 34) & (df_app3['model'] != 'other')
df_app3.loc[pelow,['power']] = 0
trabantlow = (df_app3['brand'] == 'trabant') & (df_app3['power']>0) & (df_app3['power'] < 23) & (df_app3['model'] != 'other')
df_app3.loc[trabantlow,['power']] = 0
nislow = (df_app3['brand'] == 'nissan') & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[nislow,['power']] = 0
sk45 = (df_app3['brand'].isin(['mazda','smart','seat','skoda','mitsubishi','toyota','volvo','honda','suzuki'])) & (df_app3['power']>0) & (df_app3['power'] < 45) & (df_app3['model'] != 'other')
df_app3.loc[sk45,['power']] = 0
hylow = (df_app3['brand'].isin(['hyundai'])) & (df_app3['power']>0) & (df_app3['power'] < 49) & (df_app3['model'] != 'other')
df_app3.loc[hylow,['power']] = 0
subarulow = (df_app3['brand'].isin(['subaru'])) & (df_app3['power']>0) & (df_app3['power'] < 54) & (df_app3['model'] != 'other')
df_app3.loc[subarulow,['power']] = 0
dacialow = (df_app3['brand'].isin(['dacia'])) & (df_app3['power']>0) & (df_app3['power'] < 67) & (df_app3['model'] != 'other')
df_app3.loc[dacialow,['power']] = 0
k55 = (df_app3['brand'].isin(['rover','kia','lancia'])) & (df_app3['power']>0) & (df_app3['power'] < 55) & (df_app3['model'] != 'other')
df_app3.loc[k55,['power']] = 0
lrlow = (df_app3['brand'].isin(['land_rover'])) & (df_app3['power']>0) & (df_app3['power'] < 50) & (df_app3['model'] != 'other')
df_app3.loc[k55,['power']] = 0
fiat500low = (df_app3['brand'] == 'fiat') & (df_app3['model'] == '500') & (df_app3['registrationyear'] > 1975) & (df_app3['power']>0) & (df_app3['power']< 69) & (df_app3['model'] != 'other') & (df_app3['brand'] != 'sonstige_autos')
df_app3.loc[fiat500low,['power']] = 0
freelanderlow = (df_app3['brand'] == 'land_rover') & (df_app3['model'] == 'freelander') & (df_app3['power']>0) & (df_app3['power']< 109)
df_app3.loc[freelanderlow,['power']] = 0
pandalow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'panda') & (df_app3['power'] > 0) & (df_app3['power']<30)
df_app3.loc[pandalow,['power']] = 0
seilow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'seicento') & (df_app3['power'] > 0) & (df_app3['power']<39)
df_app3.loc[seilow,['power']] = 0
stilow = (df_app3['brand'] == 'fiat') & (df_app3['model'] == 'stilo') & (df_app3['power'] > 0) & (df_app3['power']<59)
df_app3.loc[stilow,['power']] = 0
beetle03 = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'beetle') & (df_app3['registrationyear'] >2002) & (df_app3['power'] > 0) & (df_app3['power']<75)
df_app3.loc[beetle03,['power']] = 0
polow = (df_app3['brand'] == 'volkswagen') & (df_app3['model'] == 'polo') & (df_app3['power']>0) & (df_app3['power'] < 37)
df_app3.loc[polow,['power']] = 0
luplow = (df_app3['model'] == 'lupo') & (df_app3['power']>0) & (df_app3['power'] < 45)
df_app3.loc[luplow,['power']] = 0
golflow = (df_app3['model'] == 'golf') & (df_app3['power']>0) & (df_app3['power'] < 50)
df_app3.loc[golflow,['power']] = 0
movlow = (df_app3['model'] == 'move') & (df_app3['power']>0) & (df_app3['power'] < 40)
df_app3.loc[movlow,['power']] = 0
sharanlow = (df_app3['model'] == 'sharan') & (df_app3['power']>0) & (df_app3['power'] < 90)
df_app3.loc[sharanlow,['power']] = 0
twinlow = (df_app3['model'] == 'twingo') & (df_app3['power']>0) & (df_app3['power'] < 40)
del too_low
del bmw
del opellow
del vwlow
del citroenlow
del fordlow
del renaultlow
del merclow
del bmwlow
del audilow
del fiatlow
del pelow
del trabantlow
del nislow
del sk45
del hylow
del subarulow
del dacialow
del k55
del lrlow
del fiat500low
del freelanderlow
del pandalow
del seilow
del stilow
del beetle03
del polow
del luplow
del golflow
del movlow
del sharanlow
del twinlow
gc.collect()
0
def fill_gearbox(df, threshold=0.9, verbose=True):
df = df.copy()
df['gearbox'] = df['gearbox'].str.lower().str.strip()
fill_strategies = [
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype'],
['brand', 'model'],
['brand']
]
total_filled = 0
start_missing = df['gearbox'].isna().sum()
for cols in fill_strategies:
# Count how many "auto" and "manual" in each group
group_counts = (
df.dropna(subset=['gearbox'])
.groupby(cols)['gearbox']
.value_counts(normalize=True)
.rename('ratio')
.reset_index()
)
# Keep only those where ratio >= threshold
group_confident = (
group_counts[group_counts['ratio'] >= threshold]
.drop_duplicates(subset=cols)
.rename(columns={'gearbox': 'fill_value'})
.drop(columns=['ratio'])
)
if group_confident.empty:
continue
df = df.merge(group_confident, on=cols, how='left', suffixes=('', '_fill'))
mask = df['gearbox'].isna() & df['fill_value'].notna()
filled_now = mask.sum()
df.loc[mask, 'gearbox'] = df.loc[mask, 'fill_value']
df.drop(columns='fill_value', inplace=True)
total_filled += filled_now
if verbose and filled_now > 0:
print(f"Filled {filled_now} missing gearbox values using {cols} (≥{threshold*100:.0f}% majority rule)")
if df['gearbox'].isna().sum() == 0:
break
if verbose:
end_missing = df['gearbox'].isna().sum()
print(f"\n✅ Gearbox filling complete: {start_missing - end_missing} filled, {end_missing} still missing.")
return df
df_app3g = fill_gearbox(df_app3)
Filled 7629 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥90% majority rule) Filled 1128 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥90% majority rule) Filled 438 missing gearbox values using ['brand', 'model', 'fueltype'] (≥90% majority rule) Filled 1612 missing gearbox values using ['brand', 'model'] (≥90% majority rule) Filled 1678 missing gearbox values using ['brand'] (≥90% majority rule) ✅ Gearbox filling complete: 12485 filled, 7345 still missing.
df_app3g.to_pickle('checkpoint_01.pkl')
cvt = (df_app3g['model'].isin(['corsa'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','bus']))
df_app3g.loc[cvt,'vehicletype'] = np.nan
gbus = (df_app3g['model'].isin(['golf'])) & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[gbus,['vehicletype']] = np.nan
puv = (df_app3g['model'].isin(['polo'])) & (df_app3g['vehicletype'].isin(['bus', 'suv']))
df_app3g.loc[puv,['vehicletype']] = np.nan
bmwsuv = (df_app3g['model'].isin(['3er'])) & (df_app3g['vehicletype'].isin(['bus', 'suv']))
df_app3g.loc[bmwsuv,['vehicletype']] = np.nan
astrabus = (df_app3g['model'].isin(['astra'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[astrabus,['vehicletype']] = np.nan
nosuv = (df_app3g['vehicletype'] == 'suv') & (df_app3g['model'].isin(['beetle','combo','transporter','vectra', 'verso','500','vito','a3','vivaro','a4','transit','m_reihe','astra','b_klasse','slk','corolla','corsa','doblo','r19','fabia','focus','picanto', 'omega', '147']))
df_app3g.loc[nosuv,['vehicletype']] = np.nan
noconvertible = (df_app3g['vehicletype'] == 'convertible') & (df_app3g['model'].isin(['ypsilon','100','passat','200','7er','90','a_klasse','antara','c2','calibra','forester','galaxy','glk','i3','kuga','nubira','zafira']))
df_app3g.loc[noconvertible,['vehicletype']] = np.nan
nocoupe = (df_app3g['vehicletype'] == 'coupe') & (df_app3g['model'].isin(['micra','aygo','9000','v70','a1','arosa','toledo','bora','ptcruiser','cx_reihe','seicento','getz','meriva','zafira']))
df_app3g.loc[nocoupe,['vehicletype']] = np.nan
nobus = (df_app3g['vehicletype'] == 'bus') & (df_app3g['model'].isin(['c5','civic','mondeo','astra','tucson','antara','a4','5er','4_reihe','x_trail','a6','sl','tigra','swift','micra','santa','forester','galant','justy','punto','panda','pajero','outlander','omega','m_klasse','mx_reihe','materia','lancer']))
df_app3g.loc[nobus,['vehicletype']] = np.nan
nowagon = (df_app3g['vehicletype'] == 'wagon') & (df_app3g['model'].isin(['jazz','calibra','200','getz','twingo','yeti','g_klasse','fox','arosa','clk','i3','musa','touareg','lanos','micra','a2','90','q3','lupo','santa','kappa','kalos','sl','niva','spark','slk']))
df_app3g.loc[nowagon,['vehicletype']] = np.nan
nosedan = (df_app3g['vehicletype'] == 'sedan') & (df_app3g['model'].isin(['v50','galaxy','z_reihe','s_max','materia','forester','tucson','move','cayenne','spider','sorento','cx_reihe','antara','rav','combo','cr_reihe']))
df_app3g.loc[nosedan,['vehicletype']] = np.nan
nosmall = (df_app3g['vehicletype'] == 'small') & (df_app3g['model'].isin(['doblo','verso','vivaro','6_reihe','defender','kuga','croma','m_reihe','grand','cayenne','rangerover','a6','sportage','accord','octavia','impreza','s_type','s_klasse','rx_reihe']))
df_app3g.loc[nosmall,['vehicletype']] = np.nan
noaudi = (df_app3g['model'] == 'audi')
df_app3g.loc[noaudi,['model']] = np.nan
notrab = (df_app3g['brand'] == 'trabant') & (df_app3g['model'] == '601') & (df_app3g['vehicletype'].isin(['coupe','suv']))
df_app3g.loc[notrab,['vehicletype']] = np.nan
nokiacoupe = (df_app3g['brand'].isin(['kia'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[nokiacoupe,['vehicletype']] = np.nan
daewc = (df_app3g['brand'].isin(['daewoo'])) & (df_app3g['model'] == 'lanos') & (df_app3g['vehicletype'].isin(['coupe','wagon']))
df_app3g.loc[daewc,['vehicletype']] = np.nan
lanc = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['kappa','delta'])) & (df_app3g['vehicletype'] == 'coupe')
df_app3g.loc[lanc,['vehicletype']] = np.nan
alfa147 = (df_app3g['brand'] == 'alfa_romeo') & (df_app3g['model'] == '147') & ~(df_app3g['vehicletype'].isin(['small','other']))
df_app3g.loc[alfa147,['vehicletype']] = np.nan
rovernos = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['rangerover'])) & ~(df_app3g['vehicletype'].isin(['suv','other']))
df_app3g.loc[rovernos,['vehicletype']] = np.nan
ibizano = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['ibiza'])) & ~(df_app3g['vehicletype'].isin(['other','small','sedan']))
df_app3g.loc[ibizano,['vehicletype']] = np.nan
alteano = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['altea'])) & ~(df_app3g['vehicletype'].isin(['other','small']))
df_app3g.loc[alteano,['vehicletype']] = np.nan
focuscb = (df_app3g['brand'] == 'ford') & (df_app3g['model'] == 'focus') & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[focuscb,['vehicletype']] = np.nan
ccw = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'] == 'crossfire') & (df_app3g['vehicletype'] == 'wagon')
df_app3g.loc[ccw,['vehicletype']] = np.nan
slcs = (df_app3g['brand'] == 'seat') & (df_app3g['model'] == 'leon') & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[slcs,['vehicletype']] = np.nan
mcb = (df_app3g['brand'] == 'mazda') & (df_app3g['model'] == '3_reihe') & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[mcb,['vehicletype']] = np.nan
calc = (df_app3g['brand'] == 'opel') & (df_app3g['model'] == 'calibra') & (df_app3g['vehicletype'] != 'coupe')
df_app3g.loc[calc,['vehicletype']] = np.nan
hicsb = (df_app3g['brand'] == 'hyundai') & (df_app3g['model'] == 'i_reihe') & (df_app3g['vehicletype'].isin(['coupe','suv','bus']))
df_app3g.loc[hicsb,['vehicletype']] = np.nan
f500 = (df_app3g['brand'] == 'fiat') & (df_app3g['model'] == '500') & ~(df_app3g['vehicletype'].isin(['small','convertible']))
df_app3g.loc[f500,['vehicletype']] = np.nan
fpun = (df_app3g['brand'] == 'fiat') & (df_app3g['model'] == 'punto') & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[fpun,['vehicletype']] = np.nan
daian = (df_app3g['brand'] == 'daihatsu') & (df_app3g['model'] == 'terios') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[daian,['vehicletype']] = np.nan
ladan = (df_app3g['brand'] == 'lada') & (df_app3g['model'] == 'niva') & (df_app3g['vehicletype'].isin(['bus','sedan']))
df_app3g.loc[ladan,['vehicletype']] = np.nan
aq5 = (df_app3g['brand'] == 'audi') & (df_app3g['model'] == 'q5') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[aq5,['vehicletype']] = np.nan
aq7 = (df_app3g['brand'] == 'audi') & (df_app3g['model'] == 'q7') & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[aq7,['vehicletype']] = np.nan
dd = (df_app3g['brand'] == 'dacia') & (df_app3g['model'] == 'duster') & (df_app3g['vehicletype'].isin(['bus','wagon']))
df_app3g.loc[dd,['vehicletype']] = np.nan
tr = (df_app3g['brand'] == 'toyota') & (df_app3g['model'] == 'rav') & (df_app3g['vehicletype'].isin(['small','convertible']))
df_app3g.loc[tr,['vehicletype']] = np.nan
vxc = (df_app3g['brand'] == 'volvo') & (df_app3g['model'] == 'xc_reihe') & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[vxc,['vehicletype']] = np.nan
sandan = (df_app3g['brand'] == 'dacia') & (df_app3g['model'] == 'sandero') & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[sandan,['vehicletype']] = np.nan
sju = (df_app3g['brand'] == 'subaru') & (df_app3g['model'] == 'justy') & (df_app3g['vehicletype'].isin(['suv','sedan','wagon']))
df_app3g.loc[sju,['vehicletype']] = np.nan
lym = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['ypsilon','musa'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[lym,['vehicletype']] = np.nan
dmat = (df_app3g['brand'] == 'daewoo') & (df_app3g['model'].isin(['matiz'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[dmat,['vehicletype']] = np.nan
tay = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['yaris'])) & (df_app3g['vehicletype'].isin(['bus','wagon']))
df_app3g.loc[tay,['vehicletype']] = np.nan
tayr = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['aygo','auris'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[tayr,['vehicletype']] = np.nan
tus = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['corolla'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[tus,['vehicletype']] = np.nan
coops = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['cooper'])) & (df_app3g['vehicletype'].isin(['suv','wagon','bus']))
df_app3g.loc[coops,['vehicletype']] = np.nan
coops = (df_app3g['brand'] == 'mini') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[coops,['vehicletype']] = np.nan
mone = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['one'])) & (df_app3g['vehicletype'].isin(['suv','sedan']))
df_app3g.loc[mone,['vehicletype']] = np.nan
clubmn = (df_app3g['brand'] == 'mini') & (df_app3g['model'].isin(['clubman'])) & (df_app3g['vehicletype'].isin(['coupe','sedan']))
df_app3g.loc[clubmn,['vehicletype']] = np.nan
suzsw = (df_app3g['brand'] == 'suzuki') & (df_app3g['model'].isin(['swift'])) & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[suzsw,['vehicletype']] = np.nan
cit12 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c1','c2'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[cit12,['vehicletype']] = np.nan
cit4 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c4'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[cit4,['vehicletype']] = np.nan
cit3 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c3'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','bus']))
df_app3g.loc[cit3,['vehicletype']] = np.nan
kr = (df_app3g['brand'] == 'kia') & (df_app3g['model'].isin(['rio'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[kr,['vehicletype']] = np.nan
cs = (df_app3g['brand'] == 'chevrolet') & (df_app3g['model'].isin(['spark'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[cs,['vehicletype']] = np.nan
p2 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['2_reihe'])) & (df_app3g['vehicletype'].isin(['suv','bus']))
df_app3g.loc[p2,['vehicletype']] = np.nan
p1 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['1_reihe'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','convertible']))
df_app3g.loc[p1,['vehicletype']] = np.nan
p3 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['3_reihe'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[p3,['vehicletype']] = np.nan
hg = (df_app3g['brand'] == 'hyundai') & (df_app3g['model'].isin(['getz'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[hg,['vehicletype']] = np.nan
oc = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['corsa'])) & (df_app3g['vehicletype'].isin(['coupe','convertible']))
df_app3g.loc[oc,['vehicletype']] = np.nan
oa = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['agila'])) & (df_app3g['vehicletype'].isin(['bus','wagon','sedan']))
df_app3g.loc[oa,['vehicletype']] = np.nan
omer = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['meriva'])) & (df_app3g['vehicletype'].isin(['bus','suv','sedan']))
df_app3g.loc[omer,['vehicletype']] = np.nan
ok = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['kadett'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[ok,['vehicletype']] = np.nan
oz = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['zafira'])) & (df_app3g['vehicletype'].isin(['suv','sedan']))
df_app3g.loc[oz,['vehicletype']] = np.nan
hj = (df_app3g['brand'] == 'honda') & (df_app3g['model'].isin(['jazz'])) & (df_app3g['vehicletype'].isin(['bus','coupe','sedan']))
df_app3g.loc[hj,['vehicletype']] = np.nan
mak = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['a_klasse'])) & (df_app3g['vehicletype'].isin(['bus','suv','wagon','coupe']))
df_app3g.loc[mak,['vehicletype']] = np.nan
mbk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['b_klasse'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[mbk,['vehicletype']] = np.nan
mck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['clk'])) & (df_app3g['vehicletype'].isin(['sedan','small','suv']))
df_app3g.loc[mck,['vehicletype']] = np.nan
msk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['sprinter'])) & (df_app3g['vehicletype'].isin(['sedan','small']))
df_app3g.loc[msk,['vehicletype']] = np.nan
mvk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['viano'])) & (df_app3g['vehicletype'].isin(['sedan','small']))
df_app3g.loc[mvk,['vehicletype']] = np.nan
mvtk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['vito'])) & (df_app3g['vehicletype'].isin(['small']))
df_app3g.loc[mvtk,['vehicletype']] = np.nan
nn = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['note'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[nn,['vehicletype']] = np.nan
ff = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['fiesta'])) & (df_app3g['vehicletype'].isin(['bus','convertible']))
df_app3g.loc[ff,['vehicletype']] = np.nan
fk = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['ka'])) & (df_app3g['vehicletype'].isin(['coupe','wagon','convertible']))
df_app3g.loc[fk,['vehicletype']] = np.nan
ffu = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['fusion'])) & (df_app3g['vehicletype'].isin(['wagon','bus']))
df_app3g.loc[ffu,['vehicletype']] = np.nan
ffo = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['focus'])) & (df_app3g['vehicletype'].isin(['bus','suv','coupe','convertible']))
df_app3g.loc[ffo,['vehicletype']] = np.nan
fe = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['escort'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[fe,['vehicletype']] = np.nan
fm = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['mondeo'])) & (df_app3g['vehicletype'].isin(['small','coupe']))
df_app3g.loc[fm,['vehicletype']] = np.nan
sf2 = (df_app3g['brand'] == 'smart') & (df_app3g['model'].isin(['fortwo'])) & (df_app3g['vehicletype'].isin(['bus','sedan']))
df_app3g.loc[sf2,['vehicletype']] = np.nan
sf4 = (df_app3g['brand'] == 'smart') & (df_app3g['model'].isin(['forfour'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','coupe','convertible']))
df_app3g.loc[sf4,['vehicletype']] = np.nan
sf4 = (df_app3g['brand'] == 'smart') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[sf4,['vehicletype']] = np.nan
fsed = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['panda','seicento'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[fsed,['vehicletype']] = np.nan
sleo = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['leon'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[sleo,['vehicletype']] = np.nan
sm = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['mii'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[sm,['vehicletype']] = np.nan
rc = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['clio'])) & (df_app3g['vehicletype'].isin(['coupe','bus']))
df_app3g.loc[rc,['vehicletype']] = np.nan
rt = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['twingo'])) & (df_app3g['vehicletype'].isin(['sedan','coupe','convertible']))
df_app3g.loc[rt,['vehicletype']] = np.nan
rm = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['modus'])) & (df_app3g['vehicletype'].isin(['sedan','bus','wagon']))
df_app3g.loc[rm,['vehicletype']] = np.nan
rme = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['megane'])) & (df_app3g['vehicletype'].isin(['suv','bus','small']))
df_app3g.loc[rme,['vehicletype']] = np.nan
rk = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['kangoo'])) & (df_app3g['vehicletype'].isin(['suv','sedan','small']))
df_app3g.loc[rk,['vehicletype']] = np.nan
skf = (df_app3g['brand'] == 'skoda') & (df_app3g['model'].isin(['fabia'])) & (df_app3g['vehicletype'].isin(['bus','convertible']))
df_app3g.loc[skf,['vehicletype']] = np.nan
skc = (df_app3g['brand'] == 'skoda') & (df_app3g['model'].isin(['citigo'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[skc,['vehicletype']] = np.nan
vwp = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['polo'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','convertible']))
df_app3g.loc[vwp,['vehicletype']] = np.nan
vwu = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['up'])) & (df_app3g['vehicletype'].isin(['sedan','suv']))
df_app3g.loc[vwu,['vehicletype']] = np.nan
vwg = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['golf','passat'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[vwg,['vehicletype']] = np.nan
vwb = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['beetle'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[vwb,['vehicletype']] = np.nan
vwc = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['caddy'])) & (df_app3g['vehicletype'].isin(['small','suv','sedan','convertible']))
df_app3g.loc[vwc,['vehicletype']] = np.nan
vwf = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['fox'])) & (df_app3g['vehicletype'].isin(['coupe','convertible']))
df_app3g.loc[vwf,['vehicletype']] = np.nan
vwl = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['lupo'])) & (df_app3g['vehicletype'].isin(['coupe','convertible', 'bus','sedan']))
df_app3g.loc[vwl,['vehicletype']] = np.nan
vws = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['scirocco'])) & (df_app3g['vehicletype'].isin(['small','convertible','sedan']))
df_app3g.loc[vws,['vehicletype']] = np.nan
vwt = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['touran'])) & (df_app3g['vehicletype'].isin(['small','convertible','sedan','suv','wagon']))
df_app3g.loc[vwt,['vehicletype']] = np.nan
vwj = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['jetta'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[vwj,['vehicletype']] = np.nan
vwsh = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['sharan'])) & (df_app3g['vehicletype'].isin(['small','wagon','sedan','suv']))
df_app3g.loc[vwsh,['vehicletype']] = np.nan
vwtrans = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['transporter'])) & (df_app3g['vehicletype'].isin(['small','sedan','wagon']))
df_app3g.loc[vwtrans,['vehicletype']] = np.nan
bmwx = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['x_reihe'])) & (df_app3g['vehicletype'].isin(['wagon','sedan','bus']))
df_app3g.loc[bmwx,['vehicletype']] = np.nan
b5 = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['5er'])) & (df_app3g['vehicletype'].isin(['small','suv']))
df_app3g.loc[b5,['vehicletype']] = np.nan
b1 = (df_app3g['brand'] == 'bmw') & (df_app3g['model'].isin(['1er'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[b1,['vehicletype']] = np.nan
maz3 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['3_reihe'])) & (df_app3g['vehicletype'].isin(['wagon','coupe','convertible']))
df_app3g.loc[maz3,['vehicletype']] = np.nan
maz6 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['6_reihe'])) & (df_app3g['vehicletype'].isin(['coupe','convertible','bus','small']))
df_app3g.loc[maz6,['vehicletype']] = np.nan
mbck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['c_klasse'])) & (df_app3g['vehicletype'].isin(['bus','small','other']))
df_app3g.loc[mbck,['vehicletype']] = np.nan
mbck = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'] == 'c_klasse') & (df_app3g['registrationyear'] == 2001) & (df_app3g['power'] == 122) & (df_app3g['fueltype'] == 'gasoline') & (df_app3g['mileage'] == 150000) & (df_app3g['price'] > 1799) & (df_app3g['price'] < 3501)
df_app3g.loc[mbck,['vehicletype']] = 'sedan'
mbek = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['e_klasse'])) & (df_app3g['vehicletype'].isin(['bus','small','suv']))
df_app3g.loc[mbek,['vehicletype']] = np.nan
mbsk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['s_klasse'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[mbsk,['vehicletype']] = np.nan
mbcs = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['cl','sl'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[mbcs,['vehicletype']] = np.nan
mbglk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['glk'])) & (df_app3g['vehicletype'].isin(['sedan','coupe']))
df_app3g.loc[mbglk,['vehicletype']] = np.nan
vwbor = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'].isin(['bora'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[vwbor,['vehicletype']] = np.nan
aa4 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a4'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[aa4,['vehicletype']] = np.nan
aa6 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a6'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[aa6,['vehicletype']] = np.nan
aa8 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a8'])) & (df_app3g['vehicletype'].isin(['small','wagon']))
df_app3g.loc[aa8,['vehicletype']] = np.nan
aa5 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a5'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[aa5,['vehicletype']] = np.nan
aa1 = (df_app3g['brand'] == 'audi') & (df_app3g['model'].isin(['a1','q3'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[aa1,['vehicletype']] = np.nan
fc = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['c_max'])) & (df_app3g['vehicletype'].isin(['sedan','bus','suv']))
df_app3g.loc[fc,['vehicletype']] = np.nan
fm = (df_app3g['brand'] == 'ford') & (df_app3g['model'].isin(['mustang'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[fm,['vehicletype']] = np.nan
ov = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['vectra'])) & (df_app3g['vehicletype'].isin(['small','bus','convertible']))
df_app3g.loc[ov,['vehicletype']] = np.nan
os = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['signum'])) & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[os,['vehicletype']] = np.nan
omega = (df_app3g['brand'] == 'opel') & (df_app3g['model'].isin(['omega'])) & (df_app3g['vehicletype'].isin(['small']))
df_app3g.loc[omega,['vehicletype']] = np.nan
p5 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['5_reihe'])) & (df_app3g['vehicletype'].isin(['coupe','small','convertible']))
df_app3g.loc[p5,['vehicletype']] = np.nan
p4 = (df_app3g['brand'] == 'peugeot') & (df_app3g['model'].isin(['4_reihe'])) & (df_app3g['vehicletype'].isin(['suv','small']))
df_app3g.loc[p4,['vehicletype']] = np.nan
rlag = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['laguna'])) & (df_app3g['vehicletype'].isin(['coupe','small','convertible']))
df_app3g.loc[rlag,['vehicletype']] = np.nan
rsc = (df_app3g['brand'] == 'renault') & (df_app3g['model'].isin(['scenic'])) & (df_app3g['vehicletype'].isin(['suv','sedan','bus']))
df_app3g.loc[rsc,['vehicletype']] = np.nan
ml = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['lancer'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[ml,['vehicletype']] = np.nan
mco = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['colt'])) & (df_app3g['vehicletype'].isin(['suv','wagon','bus','sedan']))
df_app3g.loc[mco,['vehicletype']] = np.nan
mout = (df_app3g['brand'] == 'mitsubishi') & (df_app3g['model'].isin(['outlander'])) & (df_app3g['vehicletype'].isin(['wagon','sedan']))
df_app3g.loc[mout,['vehicletype']] = np.nan
cc5 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c5'])) & (df_app3g['vehicletype'].isin(['small','bus']))
df_app3g.loc[cc5,['vehicletype']] = np.nan
st = (df_app3g['brand'] == 'seat') & (df_app3g['model'].isin(['toledo'])) & (df_app3g['vehicletype'].isin(['small','bus']))
df_app3g.loc[st,['vehicletype']] = np.nan
tv = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['verso'])) & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[tv,['vehicletype']] = np.nan
ta = (df_app3g['brand'] == 'toyota') & (df_app3g['model'].isin(['avensis'])) & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[ta,['vehicletype']] = np.nan
vv40 = (df_app3g['brand'] == 'volvo') & (df_app3g['model'].isin(['v40'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[vv40,['vehicletype']] = np.nan
vcr = (df_app3g['brand'] == 'volvo') & (df_app3g['model'].isin(['c_reihe'])) & (df_app3g['vehicletype'].isin(['sedan','wagon']))
df_app3g.loc[vcr,['vehicletype']] = np.nan
fbrav = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['bravo'])) & (df_app3g['vehicletype'].isin(['sedan','wagon','coupe']))
df_app3g.loc[fbrav,['vehicletype']] = np.nan
c300 = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'].isin(['300c'])) & (df_app3g['vehicletype'].isin(['wagon']))
df_app3g.loc[c300,['vehicletype']] = np.nan
dand = (df_app3g['brand'] == 'dacia') & (df_app3g['model'].isin(['logan'])) & (df_app3g['vehicletype'].isin(['suv']))
df_app3g.loc[dand,['vehicletype']] = np.nan
land = (df_app3g['brand'] == 'lancia') & (df_app3g['model'].isin(['delta']))
df_app3g.loc[land,['vehicletype']] = 'other'
rdef = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['defender']))
df_app3g.loc[rdef,['vehicletype']] = 'suv'
jb = (df_app3g['brand'] == 'jeep') & (df_app3g['vehicletype'].isin(['bus']))
df_app3g.loc[jb,['vehicletype']] = np.nan
rdisc = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['discovery'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[rdisc,['vehicletype']] = np.nan
norover = (df_app3g['brand'] == 'rover') & (df_app3g['model'].isin(['defender','freelander','discovery','rangerover']))
df_app3g.loc[norover,['brand']] = 'land_rover'
lrfree = (df_app3g['brand'] == 'land_rover') & (df_app3g['model'].isin(['freelander'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[lrfree,['vehicletype']] = np.nan
nq = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['qashqai'])) & (df_app3g['vehicletype'].isin(['sedan','bus','wagon']))
df_app3g.loc[nq,['vehicletype']] = np.nan
nq = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['qashqai'])) & (df_app3g['vehicletype'].isna())
df_app3g.loc[nq,['vehicletype']] = 'suv'
nnav = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['navara'])) & (df_app3g['vehicletype'].isin(['sedan']))
df_app3g.loc[nnav,['vehicletype']] = np.nan
nnav = (df_app3g['brand'] == 'nissan') & (df_app3g['model'].isin(['navara'])) & (df_app3g['vehicletype'].isna())
df_app3g.loc[nnav,['vehicletype']] = 'suv'
hcr = (df_app3g['brand'] == 'honda') & (df_app3g['model'].isin(['cr_reihe'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[hcr,['vehicletype']] = np.nan
mcon = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['5_reihe','cx_reihe','1_reihe'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[mcon,['vehicletype']] = np.nan
maz5 = (df_app3g['brand'] == 'mazda') & (df_app3g['model'].isin(['5_reihe'])) & (df_app3g['vehicletype'].isin(['suv','wagon','sedan']))
df_app3g.loc[maz5,['vehicletype']] = np.nan
cit3 = (df_app3g['brand'] == 'citroen') & (df_app3g['model'].isin(['c3'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[cit3,['vehicletype']] = np.nan
fvert = (df_app3g['brand'] == 'fiat') & (df_app3g['model'].isin(['punto','panda'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[fvert,['vehicletype']] = np.nan
zuvert = (df_app3g['brand'] == 'suzuki') & (df_app3g['model'].isin(['swift','grand'])) & (df_app3g['vehicletype'].isin(['convertible']))
df_app3g.loc[zuvert,['vehicletype']] = np.nan
mgk = (df_app3g['brand'] == 'mercedes_benz') & (df_app3g['model'].isin(['g_klasse'])) & (df_app3g['vehicletype'].isin(['convertible','sedan']))
df_app3g.loc[mgk,['vehicletype']] = np.nan
arsp = (df_app3g['brand'] == 'alfa_romeo') & (df_app3g['model'].isin(['spider'])) & (df_app3g['vehicletype'].isin(['coupe']))
df_app3g.loc[arsp,['vehicletype']] = np.nan
toua = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'] == 'tiguan') & (df_app3g['vehicletype'].isin(['sedan','bus']))
df_app3g.loc[toua,['vehicletype']] = np.nan
toua = (df_app3g['brand'] == 'volkswagen') & (df_app3g['model'] == 'tiguan') & (df_app3g['vehicletype'].isna())
df_app3g.loc[toua,['vehicletype']] = 'suv'
ptc = (df_app3g['brand'] == 'chrysler') & (df_app3g['model'] == 'ptcruiser') & (df_app3g['vehicletype'] == 'bus')
df_app3g.loc[ptc,['vehicletype']] = np.nan
print(df_app3g.memory_usage(deep=True).sum() / 1_000_000, "MB")
248.205326 MB
gc.collect()
0
del toua
del jb
del rdisc
del norover
del lrfree
del nq
del nnav
del hcr
del mcon
del maz5
del fvert
del zuvert
del mgk
del arsp
del bmwx
del b5
del b1
del maz3
del maz6
del mbck
del mbek
del mbsk
del mbcs
del mbglk
del vwbor
del aa4
del aa6
del aa8
del aa5
del aa1
del fc
del fm
del ov
del os
del omega
del p5
del p4
del rlag
del rsc
del ml
del mco
del mout
del cc5
del st
del tv
del ta
del vv40
del vcr
del fbrav
del c300
del dand
del land
del rdef
del fsed
del sleo
del sm
del rc
del rt
del rm
del rme
del rk
del skf
del skc
del vwp
del vwu
del vwg
del vwb
del vwc
del vwf
del vwl
del vws
del vwt
del vwj
del vwsh
del vwtrans
del kr
del cs
del p2
del p1
del p3
del hg
del oc
del oa
del omer
del ok
del oz
del hj
del mak
del mbk
del mck
del msk
del mvk
del mvtk
del nn
del ff
del fk
del ffu
del ffo
del fe
del sf2
del sf4
del cit4
del cit3
del cit12
del suzsw
del clubmn
del mone
del coops
del tus
del tayr
del tay
del dmat
del lym
del sju
del sandan
del vxc
del tr
del dd
del aq7
del aq5
del ladan
del daian
del fpun
del f500
del hicsb
del calc
del mcb
del slcs
del ccw
del focuscb
del alteano
del ibizano
del rovernos
del alfa147
del lanc
del daewc
del nokiacoupe
del notrab
gc.collect()
0
def fill_all_missing_values(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
"""
Fill missing values for power, vehicletype, model, fueltype using tiered group strategies.
Optimized version with better memory management and early stopping.
"""
df = df.copy()
def safe_mode(series):
"""Return mode if confident enough (>= threshold), else NaN."""
s = series.dropna()
if len(s) == 0:
return np.nan
counts = s.value_counts(normalize=True)
if len(counts) == 0:
return np.nan
top_val, top_freq = counts.index[0], counts.iloc[0]
return top_val if top_freq >= threshold else np.nan
def is_zero_condition(condition):
"""Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
try:
test = condition(pd.Series([0, np.nan], dtype=object))
if isinstance(test, (bool, np.bool_)) and test:
return True
if hasattr(test, "__len__") and len(test) >= 1:
return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
except Exception:
pass
return False
def make_key_tuple(row_vals):
"""Helper: convert list-like row values to a hashable tuple with None for NaN."""
return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)
def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
total_filled = 0
zero_check = is_zero_condition(condition)
# Track initial state
if zero_check:
initial_missing = (df[target_col] == 0).sum()
else:
initial_missing = df[target_col].isna().sum()
if initial_missing == 0:
return 0
if verbose:
print(f" → Starting with {initial_missing:,} missing values in '{target_col}'")
for cols in fill_strategies:
# Check if there's still work to do
if zero_check:
current_missing = (df[target_col] == 0).sum()
else:
current_missing = df[target_col].isna().sum()
if current_missing == 0:
break
start_time = time.time()
try:
# Compute group modes using safe_mode
group_modes = (
df.groupby(cols, dropna=False)[target_col]
.apply(safe_mode)
.reset_index()
.rename(columns={target_col: 'fill_value'})
)
# Remove groups with no valid fill value
group_modes = group_modes[group_modes['fill_value'].notna()]
if len(group_modes) == 0:
continue
except Exception as e:
if verbose:
print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
continue
# Build mapping dict from group_modes
keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
mapping = dict(zip(keys, group_modes['fill_value'].values))
# Compute fill_value per-row by mapping (keeps original row order)
row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
fill_series = row_keys.map(mapping)
# Create mask of rows that need filling AND have a candidate fill_value
mask_need = condition(df[target_col])
mask_candidate = fill_series.notna()
mask = mask_need & mask_candidate
# Count before
if zero_check:
before_missing = (df[target_col] == 0).sum()
else:
before_missing = df[target_col].isna().sum()
# Perform fill
if mask.any():
df.loc[mask, target_col] = fill_series.loc[mask].values
# Count after
if zero_check:
after_missing = (df[target_col] == 0).sum()
else:
after_missing = df[target_col].isna().sum()
filled_now = before_missing - after_missing
total_filled += int(filled_now)
if verbose and filled_now > 0:
elapsed = time.time() - start_time
print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")
return total_filled
iteration = 0
while iteration < max_iterations:
iteration += 1
total_filled = 0
if verbose:
print(f"\n🌀 Iteration {iteration} starting...")
# --- POWER ---
power_strategies = [
['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype', 'year_bin'],
['brand', 'model', 'fueltype', 'registrationyear'],
['brand', 'model', 'fueltype', 'gearbox'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model'],
['brand', 'vehicletype'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)
# --- VEHICLE TYPE ---
vehicletype_strategies = [
['brand', 'model', 'power', 'year_bin'],
['brand', 'model', 'power', 'registrationyear'],
['brand', 'model', 'power', 'gearbox'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model', 'power'],
['brand', 'model', 'gearbox'],
['brand', 'model'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'power'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('vehicletype', vehicletype_strategies)
# --- MODEL ---
model_strategies = [
['brand', 'vehicletype', 'power', 'year_bin'],
['brand', 'vehicletype', 'power', 'registrationyear'],
['brand', 'vehicletype', 'power', 'gearbox'],
['brand', 'vehicletype', 'year_bin'],
['brand', 'vehicletype', 'registrationyear'],
['brand', 'vehicletype', 'power'],
['brand', 'vehicletype', 'gearbox'],
['brand', 'vehicletype'],
['brand', 'power'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('model', model_strategies)
# --- FUELTYPE ---
fueltype_strategies = [
['brand', 'model', 'vehicletype', 'power', 'year_bin'],
['brand', 'model', 'vehicletype', 'power', 'registrationyear'],
['brand', 'model', 'vehicletype', 'power', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'power', 'year_bin'],
['brand', 'model', 'power', 'registrationyear'],
['brand', 'model', 'power', 'gearbox'],
['brand', 'model', 'power'],
['brand', 'model'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('fueltype', fueltype_strategies)
if verbose:
print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")
if not repeat_until_change or total_filled == 0:
if verbose:
print("🏁 No further changes detected, stopping.")
break
return df
df_app3g = df_app3g[df_app3g['price'] > 99].copy()
gc.collect()
0
df_app3g = fill_gearbox(df_app3g)
Filled 19 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥90% majority rule) ✅ Gearbox filling complete: 19 filled, 5525 still missing.
df_app = fill_all_missing_values(df_app3g, threshold = 0.75)
🌀 Iteration 1 starting... → Starting with 31,620 missing values in 'power' ✅ Filled 101 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (31,519 remaining, took 5.41s) ✅ Filled 91 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (31,428 remaining, took 12.76s) ✅ Filled 72 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (31,356 remaining, took 4.95s) ✅ Filled 145 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (31,211 remaining, took 3.74s) ✅ Filled 101 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (31,110 remaining, took 8.65s) ✅ Filled 62 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (31,048 remaining, took 3.55s) ✅ Filled 45 values in 'power' using ['brand', 'model', 'vehicletype'] (31,003 remaining, took 2.90s) ✅ Filled 11 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (30,992 remaining, took 4.14s) ✅ Filled 47 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (30,945 remaining, took 3.78s) ✅ Filled 93 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (30,852 remaining, took 8.22s) ✅ Filled 26 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (30,826 remaining, took 3.49s) ✅ Filled 23 values in 'power' using ['brand', 'model', 'year_bin'] (30,803 remaining, took 2.72s) ✅ Filled 116 values in 'power' using ['brand', 'model', 'registrationyear'] (30,687 remaining, took 5.28s) ✅ Filled 22 values in 'power' using ['brand', 'model'] (30,665 remaining, took 2.37s) ✅ Filled 16 values in 'power' using ['brand', 'vehicletype'] (30,649 remaining, took 2.45s) ✅ Filled 19 values in 'power' using ['brand', 'year_bin'] (30,630 remaining, took 2.28s) ✅ Filled 1 values in 'power' using ['brand', 'registrationyear'] (30,629 remaining, took 3.41s) ✅ Filled 14 values in 'power' using ['brand', 'gearbox'] (30,615 remaining, took 2.33s) → Starting with 26,591 missing values in 'vehicletype' ✅ Filled 10,146 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (16,445 remaining, took 10.68s) ✅ Filled 2,283 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (14,162 remaining, took 21.27s) ✅ Filled 3,820 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (10,342 remaining, took 9.68s) ✅ Filled 2,228 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (8,114 remaining, took 2.87s) ✅ Filled 1,685 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (6,429 remaining, took 5.63s) ✅ Filled 282 values in 'vehicletype' using ['brand', 'model', 'power'] (6,147 remaining, took 8.02s) ✅ Filled 517 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (5,630 remaining, took 2.80s) ✅ Filled 72 values in 'vehicletype' using ['brand', 'model'] (5,558 remaining, took 2.56s) ✅ Filled 5 values in 'vehicletype' using ['brand', 'year_bin'] (5,553 remaining, took 2.35s) ✅ Filled 138 values in 'vehicletype' using ['brand', 'registrationyear'] (5,415 remaining, took 3.59s) ✅ Filled 454 values in 'vehicletype' using ['brand', 'power'] (4,961 remaining, took 4.88s) ✅ Filled 6 values in 'vehicletype' using ['brand', 'gearbox'] (4,955 remaining, took 2.41s) → Starting with 11,269 missing values in 'model' ✅ Filled 2,651 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (8,618 remaining, took 11.07s) ✅ Filled 1,606 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (7,012 remaining, took 22.50s) ✅ Filled 765 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (6,247 remaining, took 9.64s) ✅ Filled 248 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (5,999 remaining, took 2.83s) ✅ Filled 342 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (5,657 remaining, took 5.69s) ✅ Filled 116 values in 'model' using ['brand', 'vehicletype', 'power'] (5,541 remaining, took 7.97s) ✅ Filled 56 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (5,485 remaining, took 2.77s) ✅ Filled 92 values in 'model' using ['brand', 'power'] (5,393 remaining, took 4.72s) ✅ Filled 6 values in 'model' using ['brand', 'year_bin'] (5,387 remaining, took 2.40s) ✅ Filled 38 values in 'model' using ['brand', 'registrationyear'] (5,349 remaining, took 3.58s) → Starting with 11,647 missing values in 'fueltype' ✅ Filled 6,346 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (5,301 remaining, took 14.24s) ✅ Filled 998 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (4,303 remaining, took 27.06s) ✅ Filled 1,418 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (2,885 remaining, took 13.31s) ✅ Filled 1,299 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (1,586 remaining, took 3.76s) ✅ Filled 288 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (1,298 remaining, took 8.66s) ✅ Filled 122 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'gearbox'] (1,176 remaining, took 3.55s) ✅ Filled 150 values in 'fueltype' using ['brand', 'model', 'power', 'year_bin'] (1,026 remaining, took 10.84s) ✅ Filled 134 values in 'fueltype' using ['brand', 'model', 'power', 'registrationyear'] (892 remaining, took 21.79s) ✅ Filled 105 values in 'fueltype' using ['brand', 'model', 'power', 'gearbox'] (787 remaining, took 9.74s) ✅ Filled 75 values in 'fueltype' using ['brand', 'model', 'power'] (712 remaining, took 7.89s) ✅ Filled 319 values in 'fueltype' using ['brand', 'model'] (393 remaining, took 2.46s) ✅ Filled 19 values in 'fueltype' using ['brand', 'year_bin'] (374 remaining, took 2.36s) ✅ Filled 26 values in 'fueltype' using ['brand', 'registrationyear'] (348 remaining, took 3.63s) ✅ Filled 9 values in 'fueltype' using ['brand', 'gearbox'] (339 remaining, took 2.43s) 🔁 Iteration 1 filled 39,869 total values 🌀 Iteration 2 starting... → Starting with 30,615 missing values in 'power' ✅ Filled 26 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (30,589 remaining, took 4.61s) ✅ Filled 63 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (30,526 remaining, took 10.83s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (30,516 remaining, took 4.34s) ✅ Filled 205 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (30,311 remaining, took 3.61s) ✅ Filled 11 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (30,300 remaining, took 7.97s) ✅ Filled 765 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (29,535 remaining, took 3.34s) ✅ Filled 405 values in 'power' using ['brand', 'model', 'vehicletype'] (29,130 remaining, took 2.83s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (29,129 remaining, took 3.48s) ✅ Filled 20 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (29,109 remaining, took 3.37s) ✅ Filled 14 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (29,095 remaining, took 7.17s) ✅ Filled 38 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (29,057 remaining, took 3.18s) ✅ Filled 247 values in 'power' using ['brand', 'model', 'year_bin'] (28,810 remaining, took 2.68s) ✅ Filled 114 values in 'power' using ['brand', 'model', 'registrationyear'] (28,696 remaining, took 5.20s) ✅ Filled 423 values in 'power' using ['brand', 'model'] (28,273 remaining, took 2.38s) ✅ Filled 1 values in 'power' using ['brand', 'vehicletype'] (28,272 remaining, took 2.41s) → Starting with 4,955 missing values in 'vehicletype' ✅ Filled 590 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (4,365 remaining, took 10.93s) ✅ Filled 277 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (4,088 remaining, took 21.86s) ✅ Filled 169 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (3,919 remaining, took 9.64s) ✅ Filled 296 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (3,623 remaining, took 2.79s) ✅ Filled 64 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (3,559 remaining, took 5.61s) ✅ Filled 38 values in 'vehicletype' using ['brand', 'model', 'power'] (3,521 remaining, took 8.03s) ✅ Filled 224 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (3,297 remaining, took 2.76s) ✅ Filled 17 values in 'vehicletype' using ['brand', 'model'] (3,280 remaining, took 2.46s) ✅ Filled 1 values in 'vehicletype' using ['brand', 'power'] (3,279 remaining, took 4.84s) → Starting with 5,349 missing values in 'model' ✅ Filled 128 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (5,221 remaining, took 11.08s) ✅ Filled 28 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (5,193 remaining, took 22.45s) ✅ Filled 20 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (5,173 remaining, took 9.89s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (5,172 remaining, took 5.76s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power'] (5,169 remaining, took 7.97s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (5,168 remaining, took 2.68s) ✅ Filled 2 values in 'model' using ['brand', 'vehicletype'] (5,166 remaining, took 2.47s) ✅ Filled 1 values in 'model' using ['brand', 'power'] (5,165 remaining, took 4.88s) → Starting with 339 missing values in 'fueltype' ✅ Filled 10 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (329 remaining, took 13.54s) ✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (328 remaining, took 8.69s) 🔁 Iteration 2 filled 4,214 total values 🌀 Iteration 3 starting... → Starting with 28,272 missing values in 'power' ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (28,271 remaining, took 4.52s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (28,270 remaining, took 10.85s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (28,269 remaining, took 4.19s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (28,268 remaining, took 3.50s) ✅ Filled 62 values in 'power' using ['brand', 'model', 'vehicletype'] (28,206 remaining, took 2.80s) ✅ Filled 11 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (28,195 remaining, took 3.28s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (28,182 remaining, took 3.23s) ✅ Filled 645 values in 'power' using ['brand', 'model', 'year_bin'] (27,537 remaining, took 2.70s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'registrationyear'] (27,524 remaining, took 5.20s) → Starting with 3,279 missing values in 'vehicletype' ✅ Filled 108 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (3,171 remaining, took 10.96s) ✅ Filled 138 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (3,033 remaining, took 21.72s) ✅ Filled 7 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (3,026 remaining, took 9.65s) ✅ Filled 108 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (2,918 remaining, took 2.82s) ✅ Filled 24 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (2,894 remaining, took 5.59s) ✅ Filled 2 values in 'vehicletype' using ['brand', 'model', 'power'] (2,892 remaining, took 7.90s) ✅ Filled 55 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (2,837 remaining, took 2.87s) → Starting with 5,165 missing values in 'model' ✅ Filled 41 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (5,124 remaining, took 11.03s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (5,123 remaining, took 22.58s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (5,122 remaining, took 9.63s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (5,119 remaining, took 5.75s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power'] (5,116 remaining, took 7.90s) → Starting with 328 missing values in 'fueltype' ✅ Filled 4 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (324 remaining, took 14.74s) ✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (323 remaining, took 27.65s) 🔁 Iteration 3 filled 1,244 total values 🌀 Iteration 4 starting... → Starting with 27,524 missing values in 'power' ✅ Filled 4 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (27,520 remaining, took 4.22s) ✅ Filled 94 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (27,426 remaining, took 3.51s) → Starting with 2,837 missing values in 'vehicletype' ✅ Filled 9 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (2,828 remaining, took 11.19s) ✅ Filled 17 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (2,811 remaining, took 22.20s) ✅ Filled 333 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (2,478 remaining, took 2.95s) → Starting with 5,116 missing values in 'model' ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (5,114 remaining, took 11.32s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (5,111 remaining, took 23.08s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (5,108 remaining, took 10.25s) → Starting with 323 missing values in 'fueltype' ✅ Filled 1 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (322 remaining, took 9.03s) 🔁 Iteration 4 filled 466 total values 🌀 Iteration 5 starting... → Starting with 27,426 missing values in 'power' ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (27,425 remaining, took 4.53s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'registrationyear'] (27,412 remaining, took 5.68s) → Starting with 2,478 missing values in 'vehicletype' → Starting with 5,108 missing values in 'model' → Starting with 322 missing values in 'fueltype' 🔁 Iteration 5 filled 14 total values 🌀 Iteration 6 starting... → Starting with 27,412 missing values in 'power' → Starting with 2,478 missing values in 'vehicletype' → Starting with 5,108 missing values in 'model' → Starting with 322 missing values in 'fueltype' 🔁 Iteration 6 filled 0 total values 🏁 No further changes detected, stopping.
mask = (df_app['brand'] == 'citroen') & (df_app['model'] == 'c4') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
mask = (df_app['brand'] == 'renault') & (df_app['model'] == 'megane') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
mask = (df_app['brand'] == 'ford') & (df_app['model'] == 'fusion') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
mask = (df_app['brand'] == 'seat') & (df_app['model'] == 'leon') & (df_app['vehicletype'] == 'bus')
df_app.loc[mask,['vehicletype']] = 'small'
del mask
print(df_app.memory_usage(deep=True).sum() / 1_000_000, "MB")
240.465945 MB
del df_app3g
df_app1 = fill_gearbox(df_app, threshold = 0.75)
Filled 2775 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥75% majority rule) Filled 456 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥75% majority rule) Filled 52 missing gearbox values using ['brand', 'model', 'fueltype'] (≥75% majority rule) Filled 561 missing gearbox values using ['brand', 'model'] (≥75% majority rule) Filled 57 missing gearbox values using ['brand'] (≥75% majority rule) ✅ Gearbox filling complete: 3901 filled, 1624 still missing.
df_app1['pc_bin'] = df_app1['postalcode'].astype(str).str[0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.75)
display(df_app1[df_app1['power'] == 0])
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.75)
df_app1[df_app1['power'] == 0]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 53 | 08/03/2016 01:36 | 800 | small | 1993.0 | manual | 0.0 | polo | 150000 | 3 | petrol | volkswagen | no | 2016-08-03 | 0 | 8258 | 05/04/2016 23:46 | N | 1990s | 8 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340778 | 14/03/2016 12:37 | 2800 | wagon | 2013.0 | manual | 0.0 | passat | 150000 | 0 | gasoline | volkswagen | NaN | 2016-03-14 | 0 | 45892 | 19/03/2016 23:46 | N | 2010_plus | 4 |
| 340779 | 14/03/2016 22:37 | 5500 | wagon | 2013.0 | auto | 0.0 | passat | 150000 | 1 | gasoline | volkswagen | no | 2016-03-14 | 0 | 90441 | 15/03/2016 19:47 | N | 2010_plus | 9 |
| 340784 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
| 340785 | 10/03/2016 19:42 | 1850 | convertible | 2013.0 | auto | 0.0 | megane | 150000 | 5 | petrol | renault | no | 2016-10-03 | 0 | 27432 | 06/04/2016 02:17 | N | 2010_plus | 2 |
| 340793 | 07/04/2016 08:36 | 1670 | convertible | 2013.0 | manual | 0.0 | megane | 90000 | 0 | petrol | renault | no | 2016-07-04 | 0 | 12167 | 07/04/2016 08:36 | N | 2010_plus | 1 |
25888 rows × 19 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 53 | 08/03/2016 01:36 | 800 | small | 1993.0 | manual | 0.0 | polo | 150000 | 3 | petrol | volkswagen | no | 2016-08-03 | 0 | 8258 | 05/04/2016 23:46 | N | 1990s | 8 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340772 | 15/03/2016 08:51 | 1300 | sedan | 2013.0 | NaN | 0.0 | 5er | 150000 | 0 | petrol | bmw | yes | 2016-03-15 | 0 | 66130 | 27/03/2016 19:46 | N | 2010_plus | 6 |
| 340773 | 08/03/2016 21:06 | 3400 | wagon | 2013.0 | manual | 0.0 | passat | 5000 | 10 | gasoline | volkswagen | no | 2016-08-03 | 0 | 35435 | 15/03/2016 11:45 | N | 2010_plus | 3 |
| 340774 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
| 340779 | 14/03/2016 22:37 | 5500 | wagon | 2013.0 | auto | 0.0 | passat | 150000 | 1 | gasoline | volkswagen | no | 2016-03-14 | 0 | 90441 | 15/03/2016 19:47 | N | 2010_plus | 9 |
| 340784 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
22256 rows × 19 columns
gc.collect()
0
df_app1 = fill_gearbox(df_app1, threshold = 0.6)
Filled 433 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥60% majority rule) Filled 173 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥60% majority rule) Filled 46 missing gearbox values using ['brand', 'model', 'fueltype'] (≥60% majority rule) Filled 43 missing gearbox values using ['brand', 'model'] (≥60% majority rule) Filled 165 missing gearbox values using ['brand'] (≥60% majority rule) ✅ Gearbox filling complete: 860 filled, 764 still missing.
gc.collect()
0
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.6)
display(df_app1[df_app1['power'] == 0])
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]
df_app1 = fill_zero_power(df_app1,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.6)
df_app1[df_app1['power'] == 0]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 53 | 08/03/2016 01:36 | 800 | small | 1993.0 | manual | 0.0 | polo | 150000 | 3 | petrol | volkswagen | no | 2016-08-03 | 0 | 8258 | 05/04/2016 23:46 | N | 1990s | 8 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340772 | 15/03/2016 08:51 | 1300 | sedan | 2013.0 | manual | 0.0 | 5er | 150000 | 0 | petrol | bmw | yes | 2016-03-15 | 0 | 66130 | 27/03/2016 19:46 | N | 2010_plus | 6 |
| 340773 | 08/03/2016 21:06 | 3400 | wagon | 2013.0 | manual | 0.0 | passat | 5000 | 10 | gasoline | volkswagen | no | 2016-08-03 | 0 | 35435 | 15/03/2016 11:45 | N | 2010_plus | 3 |
| 340774 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
| 340779 | 14/03/2016 22:37 | 5500 | wagon | 2013.0 | auto | 0.0 | passat | 150000 | 1 | gasoline | volkswagen | no | 2016-03-14 | 0 | 90441 | 15/03/2016 19:47 | N | 2010_plus | 9 |
| 340784 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
20858 rows × 19 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340770 | 09/03/2016 21:00 | 1150 | small | 2013.0 | auto | 0.0 | fortwo | 150000 | 11 | petrol | smart | no | 2016-09-03 | 0 | 47443 | 10/03/2016 07:46 | N | 2010_plus | 4 |
| 340773 | 08/03/2016 21:06 | 3400 | wagon | 2013.0 | manual | 0.0 | passat | 5000 | 10 | gasoline | volkswagen | no | 2016-08-03 | 0 | 35435 | 15/03/2016 11:45 | N | 2010_plus | 3 |
| 340774 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
| 340779 | 14/03/2016 22:37 | 5500 | wagon | 2013.0 | auto | 0.0 | passat | 150000 | 1 | gasoline | volkswagen | no | 2016-03-14 | 0 | 90441 | 15/03/2016 19:47 | N | 2010_plus | 9 |
| 340784 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
17893 rows × 19 columns
gc.collect()
0
df_app2 = fill_all_missing_values(df_app1, threshold = 0.6)
🌀 Iteration 1 starting... → Starting with 17,893 missing values in 'power' ✅ Filled 161 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (17,732 remaining, took 4.59s) ✅ Filled 438 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (17,294 remaining, took 10.90s) ✅ Filled 110 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (17,184 remaining, took 4.13s) ✅ Filled 117 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (17,067 remaining, took 3.60s) ✅ Filled 73 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (16,994 remaining, took 8.12s) ✅ Filled 11 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (16,983 remaining, took 3.30s) ✅ Filled 181 values in 'power' using ['brand', 'model', 'vehicletype'] (16,802 remaining, took 2.93s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (16,800 remaining, took 3.56s) ✅ Filled 57 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (16,743 remaining, took 3.52s) ✅ Filled 199 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (16,544 remaining, took 7.21s) ✅ Filled 61 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (16,483 remaining, took 3.12s) ✅ Filled 519 values in 'power' using ['brand', 'model', 'year_bin'] (15,964 remaining, took 2.84s) ✅ Filled 10 values in 'power' using ['brand', 'model', 'registrationyear'] (15,954 remaining, took 5.27s) ✅ Filled 208 values in 'power' using ['brand', 'model'] (15,746 remaining, took 2.42s) ✅ Filled 27 values in 'power' using ['brand', 'vehicletype'] (15,719 remaining, took 2.46s) ✅ Filled 2 values in 'power' using ['brand', 'year_bin'] (15,717 remaining, took 2.34s) ✅ Filled 3 values in 'power' using ['brand', 'registrationyear'] (15,714 remaining, took 3.54s) → Starting with 2,478 missing values in 'vehicletype' ✅ Filled 387 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (2,091 remaining, took 11.01s) ✅ Filled 471 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (1,620 remaining, took 21.73s) ✅ Filled 299 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (1,321 remaining, took 9.49s) ✅ Filled 265 values in 'vehicletype' using ['brand', 'model', 'year_bin'] (1,056 remaining, took 2.88s) ✅ Filled 127 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (929 remaining, took 5.74s) ✅ Filled 35 values in 'vehicletype' using ['brand', 'model', 'power'] (894 remaining, took 8.09s) ✅ Filled 94 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (800 remaining, took 2.74s) ✅ Filled 29 values in 'vehicletype' using ['brand', 'model'] (771 remaining, took 2.52s) ✅ Filled 13 values in 'vehicletype' using ['brand', 'year_bin'] (758 remaining, took 2.49s) ✅ Filled 50 values in 'vehicletype' using ['brand', 'registrationyear'] (708 remaining, took 3.71s) ✅ Filled 22 values in 'vehicletype' using ['brand', 'power'] (686 remaining, took 5.05s) → Starting with 5,108 missing values in 'model' ✅ Filled 884 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (4,224 remaining, took 11.15s) ✅ Filled 584 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (3,640 remaining, took 22.49s) ✅ Filled 211 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,429 remaining, took 9.46s) ✅ Filled 71 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (3,358 remaining, took 2.89s) ✅ Filled 88 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (3,270 remaining, took 5.82s) ✅ Filled 13 values in 'model' using ['brand', 'vehicletype', 'power'] (3,257 remaining, took 7.95s) ✅ Filled 70 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (3,187 remaining, took 2.74s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype'] (3,186 remaining, took 2.60s) ✅ Filled 21 values in 'model' using ['brand', 'power'] (3,165 remaining, took 4.93s) ✅ Filled 1 values in 'model' using ['brand', 'registrationyear'] (3,164 remaining, took 3.68s) ✅ Filled 13 values in 'model' using ['brand', 'gearbox'] (3,151 remaining, took 2.42s) → Starting with 322 missing values in 'fueltype' ✅ Filled 234 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'year_bin'] (88 remaining, took 14.43s) ✅ Filled 23 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'registrationyear'] (65 remaining, took 27.04s) ✅ Filled 12 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'power', 'gearbox'] (53 remaining, took 13.08s) ✅ Filled 29 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'year_bin'] (24 remaining, took 3.77s) ✅ Filled 13 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'registrationyear'] (11 remaining, took 8.47s) ✅ Filled 4 values in 'fueltype' using ['brand', 'model', 'vehicletype', 'gearbox'] (7 remaining, took 3.44s) ✅ Filled 2 values in 'fueltype' using ['brand', 'model', 'power', 'registrationyear'] (5 remaining, took 21.49s) ✅ Filled 1 values in 'fueltype' using ['brand', 'year_bin'] (4 remaining, took 2.49s) ✅ Filled 2 values in 'fueltype' using ['brand', 'registrationyear'] (2 remaining, took 3.75s) 🔁 Iteration 1 filled 6,248 total values 🌀 Iteration 2 starting... → Starting with 15,714 missing values in 'power' ✅ Filled 53 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (15,661 remaining, took 4.47s) ✅ Filled 37 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (15,624 remaining, took 10.68s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (15,611 remaining, took 4.09s) ✅ Filled 8 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (15,603 remaining, took 3.50s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (15,600 remaining, took 7.94s) ✅ Filled 171 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (15,429 remaining, took 3.26s) ✅ Filled 331 values in 'power' using ['brand', 'model', 'vehicletype'] (15,098 remaining, took 2.89s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (15,097 remaining, took 3.51s) ✅ Filled 19 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (15,078 remaining, took 3.28s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (15,076 remaining, took 7.16s) ✅ Filled 67 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (15,009 remaining, took 3.17s) ✅ Filled 1,095 values in 'power' using ['brand', 'model', 'year_bin'] (13,914 remaining, took 2.80s) ✅ Filled 28 values in 'power' using ['brand', 'model', 'registrationyear'] (13,886 remaining, took 5.31s) ✅ Filled 413 values in 'power' using ['brand', 'model'] (13,473 remaining, took 2.49s) → Starting with 686 missing values in 'vehicletype' ✅ Filled 22 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (664 remaining, took 11.00s) ✅ Filled 12 values in 'vehicletype' using ['brand', 'model', 'registrationyear'] (652 remaining, took 5.61s) ✅ Filled 197 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (455 remaining, took 2.89s) ✅ Filled 2 values in 'vehicletype' using ['brand', 'model'] (453 remaining, took 2.64s) → Starting with 3,151 missing values in 'model' ✅ Filled 20 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (3,131 remaining, took 11.05s) ✅ Filled 21 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (3,110 remaining, took 22.27s) ✅ Filled 64 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,046 remaining, took 9.52s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power'] (3,043 remaining, took 7.88s) → Starting with 2 missing values in 'fueltype' 🔁 Iteration 2 filled 2,582 total values 🌀 Iteration 3 starting... → Starting with 13,473 missing values in 'power' ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype'] (13,470 remaining, took 2.88s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (13,467 remaining, took 3.38s) ✅ Filled 64 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (13,403 remaining, took 3.07s) ✅ Filled 597 values in 'power' using ['brand', 'model', 'year_bin'] (12,806 remaining, took 2.77s) → Starting with 453 missing values in 'vehicletype' ✅ Filled 5 values in 'vehicletype' using ['brand', 'model', 'power', 'year_bin'] (448 remaining, took 10.93s) ✅ Filled 6 values in 'vehicletype' using ['brand', 'model', 'power', 'registrationyear'] (442 remaining, took 21.49s) ✅ Filled 8 values in 'vehicletype' using ['brand', 'model', 'power', 'gearbox'] (434 remaining, took 9.45s) → Starting with 3,043 missing values in 'model' ✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (3,038 remaining, took 11.10s) ✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,033 remaining, took 9.32s) ✅ Filled 2 values in 'model' using ['brand', 'power'] (3,031 remaining, took 4.92s) → Starting with 2 missing values in 'fueltype' 🔁 Iteration 3 filled 698 total values 🌀 Iteration 4 starting... → Starting with 12,806 missing values in 'power' → Starting with 434 missing values in 'vehicletype' ✅ Filled 42 values in 'vehicletype' using ['brand', 'model', 'gearbox'] (392 remaining, took 2.85s) ✅ Filled 5 values in 'vehicletype' using ['brand', 'model'] (387 remaining, took 2.55s) → Starting with 3,031 missing values in 'model' ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (3,029 remaining, took 9.28s) → Starting with 2 missing values in 'fueltype' 🔁 Iteration 4 filled 49 total values 🌀 Iteration 5 starting... → Starting with 12,806 missing values in 'power' → Starting with 387 missing values in 'vehicletype' → Starting with 3,029 missing values in 'model' → Starting with 2 missing values in 'fueltype' 🔁 Iteration 5 filled 0 total values 🏁 No further changes detected, stopping.
del df_app1
gc.collect()
0
mask = (df_app2['brand'] == 'volkswagen') & (df_app2['model'] == 'golf') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = np.nan
mask = (df_app2['brand'] == 'volkswagen') & (df_app2['model'] == 'golf') & (df_app2['vehicletype'].isna())
df_app2.loc[mask,['vehicletype']] = 'other'
mask = (df_app2['brand'] == 'mercedes_benz') & (df_app2['model'] == 'a_klasse') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'small'
mask = (df_app2['brand'] == 'smart') & (df_app2['model'] == 'fortwo') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'small'
mask = (df_app2['brand'] == 'lada') & (df_app2['model'] == 'niva') & (df_app2['vehicletype'] == 'bus')
df_app2.loc[mask,['vehicletype']] = 'suv'
mask = (df_app2['brand'] == 'volkswagen') & (df_app2['model'] == 'transporter') & (df_app2['vehicletype'] == 'wagon')
df_app2.loc[mask,['vehicletype']] = 'bus'
del mask
gc.collect()
0
print(df_app2.memory_usage(deep=True).sum() / 1_000_000, "MB")
260.514323 MB
df_app2 = fill_gearbox(df_app2, threshold = 0.6)
Filled 58 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (≥60% majority rule) Filled 26 missing gearbox values using ['brand', 'model', 'fueltype', 'vehicletype'] (≥60% majority rule) ✅ Gearbox filling complete: 84 filled, 680 still missing.
df_ap4 = df_app2.drop_duplicates()
df_ap4[df_ap4['power'] == 0]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340761 | 10/03/2016 22:49 | 1200 | small | 2013.0 | manual | 0.0 | i_reihe | 50000 | 12 | petrol | hyundai | NaN | 2016-10-03 | 0 | 6493 | 14/03/2016 09:16 | N | 2010_plus | 6 |
| 340769 | 11/03/2016 03:03 | 1500 | coupe | 2013.0 | NaN | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | NaN | 2016-11-03 | 0 | 40476 | 06/04/2016 04:44 | N | 2010_plus | 4 |
| 340770 | 09/03/2016 21:00 | 1150 | small | 2013.0 | auto | 0.0 | fortwo | 150000 | 11 | petrol | smart | no | 2016-09-03 | 0 | 47443 | 10/03/2016 07:46 | N | 2010_plus | 4 |
| 340774 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
| 340784 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
12806 rows × 19 columns
df_app4 = fill_zero_power(df_ap4,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.6)
display(df_app4[df_app4['power'] == 0])
df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.6)
df_app4[df_app4['power'] == 0]
df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.6)
df_app4[df_app4['power'] == 0]
df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.6)
df_app4[df_app4['power'] == 0]
df_app4 = fill_zero_power(df_app4,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.6)
display(df_app4[df_app4['power'] == 0])
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340740 | 10/03/2016 22:49 | 1200 | small | 2013.0 | manual | 0.0 | i_reihe | 50000 | 12 | petrol | hyundai | NaN | 2016-10-03 | 0 | 6493 | 14/03/2016 09:16 | N | 2010_plus | 6 |
| 340748 | 11/03/2016 03:03 | 1500 | coupe | 2013.0 | NaN | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | NaN | 2016-11-03 | 0 | 40476 | 06/04/2016 04:44 | N | 2010_plus | 4 |
| 340749 | 09/03/2016 21:00 | 1150 | small | 2013.0 | auto | 0.0 | fortwo | 150000 | 11 | petrol | smart | no | 2016-09-03 | 0 | 47443 | 10/03/2016 07:46 | N | 2010_plus | 4 |
| 340753 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
| 340763 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
12735 rows × 19 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 40 | 17/03/2016 07:56 | 4700 | wagon | 2005.0 | manual | 0.0 | signum | 150000 | 0 | gasoline | opel | no | 2016-03-17 | 0 | 88433 | 04/04/2016 04:17 | N | 2000s | 8 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 340740 | 10/03/2016 22:49 | 1200 | small | 2013.0 | manual | 0.0 | i_reihe | 50000 | 12 | petrol | hyundai | NaN | 2016-10-03 | 0 | 6493 | 14/03/2016 09:16 | N | 2010_plus | 6 |
| 340748 | 11/03/2016 03:03 | 1500 | coupe | 2013.0 | NaN | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | NaN | 2016-11-03 | 0 | 40476 | 06/04/2016 04:44 | N | 2010_plus | 4 |
| 340749 | 09/03/2016 21:00 | 1150 | small | 2013.0 | auto | 0.0 | fortwo | 150000 | 11 | petrol | smart | no | 2016-09-03 | 0 | 47443 | 10/03/2016 07:46 | N | 2010_plus | 4 |
| 340753 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
| 340763 | 15/03/2016 11:45 | 850 | other | 2013.0 | manual | 0.0 | other | 5000 | 0 | petrol | audi | no | 2016-03-15 | 0 | 86647 | 16/03/2016 07:17 | N | 2010_plus | 8 |
12516 rows × 19 columns
df_app5 = fill_missing_models_majority_x(df_app4, threshold = 0.6)
✅ Filled 5 missing models (threshold=60%)
df_app6 = fill_missing_models_majority(df_app5, threshold = 0.6)
df_app7 = df_app6[df_app6['vehicletype'].notna()]
df_app8 = df_app7[df_app7['gearbox'].notna()]
df_app9 = fill_zero_power(df_app8,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.55)
display(df_app9[df_app9['power'] == 0])
df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]
df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]
df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]
df_app9 = fill_zero_power(df_app9,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.55)
df_app9[df_app9['power'] == 0]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| 122 | 01/04/2016 16:06 | 800 | sedan | 1993.0 | manual | 0.0 | golf | 10000 | 9 | petrol | volkswagen | yes | 2016-01-04 | 0 | 65929 | 07/04/2016 11:17 | N | 1990s | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339969 | 01/04/2016 12:38 | 1299 | coupe | 2013.0 | auto | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | no | 2016-01-04 | 0 | 48703 | 01/04/2016 12:38 | N | 2010_plus | 4 |
| 339973 | 05/04/2016 02:36 | 1500 | coupe | 2013.0 | manual | 0.0 | NaN | 5000 | 11 | petrol | sonstige_autos | no | 2016-05-04 | 0 | 27474 | 05/04/2016 08:46 | N | 2010_plus | 2 |
| 339974 | 09/03/2016 12:58 | 700 | coupe | 2013.0 | manual | 0.0 | NaN | 100000 | 1 | petrol | sonstige_autos | no | 2016-09-03 | 0 | 51570 | 05/04/2016 01:16 | N | 2010_plus | 5 |
| 339976 | 20/03/2016 23:49 | 3000 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 0 | gasoline | sonstige_autos | NaN | 2016-03-20 | 0 | 85072 | 23/03/2016 11:17 | N | 2010_plus | 8 |
| 339978 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
11827 rows × 19 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 24 | 15/03/2016 20:59 | 245 | sedan | 1994.0 | manual | 0.0 | golf | 150000 | 2 | petrol | volkswagen | no | 2016-03-15 | 0 | 44145 | 17/03/2016 18:17 | N | 1990s | 4 |
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| 122 | 01/04/2016 16:06 | 800 | sedan | 1993.0 | manual | 0.0 | golf | 10000 | 9 | petrol | volkswagen | yes | 2016-01-04 | 0 | 65929 | 07/04/2016 11:17 | N | 1990s | 6 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339969 | 01/04/2016 12:38 | 1299 | coupe | 2013.0 | auto | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | no | 2016-01-04 | 0 | 48703 | 01/04/2016 12:38 | N | 2010_plus | 4 |
| 339973 | 05/04/2016 02:36 | 1500 | coupe | 2013.0 | manual | 0.0 | NaN | 5000 | 11 | petrol | sonstige_autos | no | 2016-05-04 | 0 | 27474 | 05/04/2016 08:46 | N | 2010_plus | 2 |
| 339974 | 09/03/2016 12:58 | 700 | coupe | 2013.0 | manual | 0.0 | NaN | 100000 | 1 | petrol | sonstige_autos | no | 2016-09-03 | 0 | 51570 | 05/04/2016 01:16 | N | 2010_plus | 5 |
| 339976 | 20/03/2016 23:49 | 3000 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 0 | gasoline | sonstige_autos | NaN | 2016-03-20 | 0 | 85072 | 23/03/2016 11:17 | N | 2010_plus | 8 |
| 339978 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
10919 rows × 19 columns
def fill_all_missing_values_mp(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
"""
Fill missing values for power and model using tiered group strategies.
Optimized version with better memory management and early stopping.
"""
df = df.copy()
def safe_mode(series):
"""Return mode if confident enough (>= threshold), else NaN."""
s = series.dropna()
if len(s) == 0:
return np.nan
counts = s.value_counts(normalize=True)
if len(counts) == 0:
return np.nan
top_val, top_freq = counts.index[0], counts.iloc[0]
return top_val if top_freq >= threshold else np.nan
def is_zero_condition(condition):
"""Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
try:
test = condition(pd.Series([0, np.nan], dtype=object))
if isinstance(test, (bool, np.bool_)) and test:
return True
if hasattr(test, "__len__") and len(test) >= 1:
return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
except Exception:
pass
return False
def make_key_tuple(row_vals):
"""Helper: convert list-like row values to a hashable tuple with None for NaN."""
return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)
def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
total_filled = 0
zero_check = is_zero_condition(condition)
# Track initial state
if zero_check:
initial_missing = (df[target_col] == 0).sum()
else:
initial_missing = df[target_col].isna().sum()
if initial_missing == 0:
return 0
if verbose:
print(f" → Starting with {initial_missing:,} missing values in '{target_col}'")
for cols in fill_strategies:
# Check if there's still work to do
if zero_check:
current_missing = (df[target_col] == 0).sum()
else:
current_missing = df[target_col].isna().sum()
if current_missing == 0:
break
start_time = time.time()
try:
# Compute group modes using safe_mode
group_modes = (
df.groupby(cols, dropna=False)[target_col]
.apply(safe_mode)
.reset_index()
.rename(columns={target_col: 'fill_value'})
)
# Remove groups with no valid fill value
group_modes = group_modes[group_modes['fill_value'].notna()]
if len(group_modes) == 0:
continue
except Exception as e:
if verbose:
print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
continue
# Build mapping dict from group_modes
keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
mapping = dict(zip(keys, group_modes['fill_value'].values))
# Compute fill_value per-row by mapping (keeps original row order)
row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
fill_series = row_keys.map(mapping)
# Create mask of rows that need filling AND have a candidate fill_value
mask_need = condition(df[target_col])
mask_candidate = fill_series.notna()
mask = mask_need & mask_candidate
# Count before
if zero_check:
before_missing = (df[target_col] == 0).sum()
else:
before_missing = df[target_col].isna().sum()
# Perform fill
if mask.any():
df.loc[mask, target_col] = fill_series.loc[mask].values
# Count after
if zero_check:
after_missing = (df[target_col] == 0).sum()
else:
after_missing = df[target_col].isna().sum()
filled_now = before_missing - after_missing
total_filled += int(filled_now)
if verbose and filled_now > 0:
elapsed = time.time() - start_time
print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")
return total_filled
iteration = 0
while iteration < max_iterations:
iteration += 1
total_filled = 0
if verbose:
print(f"\n🌀 Iteration {iteration} starting...")
# --- POWER ---
power_strategies = [
['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype', 'year_bin'],
['brand', 'model', 'fueltype', 'registrationyear'],
['brand', 'model', 'fueltype', 'gearbox'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model'],
['brand', 'vehicletype'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)
# --- MODEL ---
model_strategies = [
['brand', 'vehicletype', 'power', 'year_bin'],
['brand', 'vehicletype', 'power', 'registrationyear'],
['brand', 'vehicletype', 'power', 'gearbox'],
['brand', 'vehicletype', 'year_bin'],
['brand', 'vehicletype', 'registrationyear'],
['brand', 'vehicletype', 'power'],
['brand', 'vehicletype', 'gearbox'],
['brand', 'vehicletype'],
['brand', 'power'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand']
]
total_filled += fill_column('model', model_strategies)
if verbose:
print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")
if not repeat_until_change or total_filled == 0:
if verbose:
print("🏁 No further changes detected, stopping.")
break
return df
df_app10 = fill_all_missing_values_mp(df_app9, threshold = 0.55)
🌀 Iteration 1 starting... → Starting with 10,919 missing values in 'power' ✅ Filled 56 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (10,863 remaining, took 4.43s) ✅ Filled 253 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (10,610 remaining, took 10.56s) ✅ Filled 55 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (10,555 remaining, took 3.89s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (10,552 remaining, took 3.47s) ✅ Filled 15 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (10,537 remaining, took 7.86s) ✅ Filled 16 values in 'power' using ['brand', 'model', 'vehicletype'] (10,521 remaining, took 2.83s) ✅ Filled 31 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (10,490 remaining, took 3.44s) ✅ Filled 134 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (10,356 remaining, took 7.12s) ✅ Filled 8 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (10,348 remaining, took 3.16s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'registrationyear'] (10,345 remaining, took 5.16s) ✅ Filled 2 values in 'power' using ['brand', 'model'] (10,343 remaining, took 2.39s) ✅ Filled 2 values in 'power' using ['brand', 'registrationyear'] (10,341 remaining, took 3.43s) → Starting with 2,395 missing values in 'model' ✅ Filled 57 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,338 remaining, took 11.02s) ✅ Filled 45 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (2,293 remaining, took 22.31s) ✅ Filled 25 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (2,268 remaining, took 9.40s) ✅ Filled 11 values in 'model' using ['brand', 'vehicletype', 'year_bin'] (2,257 remaining, took 2.83s) ✅ Filled 34 values in 'model' using ['brand', 'vehicletype', 'registrationyear'] (2,223 remaining, took 5.75s) ✅ Filled 5 values in 'model' using ['brand', 'vehicletype', 'power'] (2,218 remaining, took 7.88s) ✅ Filled 41 values in 'model' using ['brand', 'vehicletype', 'gearbox'] (2,177 remaining, took 2.72s) ✅ Filled 2 values in 'model' using ['brand', 'registrationyear'] (2,175 remaining, took 3.62s) ✅ Filled 1 values in 'model' using ['brand'] (2,174 remaining, took 1.94s) 🔁 Iteration 1 filled 799 total values 🌀 Iteration 2 starting... → Starting with 10,341 missing values in 'power' ✅ Filled 3 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (10,338 remaining, took 4.46s) ✅ Filled 21 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (10,317 remaining, took 3.44s) ✅ Filled 4 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (10,313 remaining, took 7.82s) ✅ Filled 257 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (10,056 remaining, took 3.15s) ✅ Filled 187 values in 'power' using ['brand', 'model', 'vehicletype'] (9,869 remaining, took 2.70s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (9,868 remaining, took 3.32s) ✅ Filled 1,445 values in 'power' using ['brand', 'model', 'year_bin'] (8,423 remaining, took 2.68s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'registrationyear'] (8,421 remaining, took 5.15s) ✅ Filled 293 values in 'power' using ['brand', 'model'] (8,128 remaining, took 2.43s) → Starting with 2,174 missing values in 'model' ✅ Filled 4 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,170 remaining, took 10.96s) ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (2,168 remaining, took 22.17s) 🔁 Iteration 2 filled 2,219 total values 🌀 Iteration 3 starting... → Starting with 8,128 missing values in 'power' ✅ Filled 63 values in 'power' using ['brand', 'model', 'vehicletype', 'year_bin'] (8,065 remaining, took 3.46s) ✅ Filled 3 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (8,062 remaining, took 3.35s) ✅ Filled 40 values in 'power' using ['brand', 'model', 'vehicletype'] (8,022 remaining, took 2.84s) → Starting with 2,168 missing values in 'model' ✅ Filled 2 values in 'model' using ['brand', 'vehicletype', 'power', 'year_bin'] (2,166 remaining, took 11.15s) ✅ Filled 1 values in 'model' using ['brand', 'vehicletype', 'power', 'registrationyear'] (2,165 remaining, took 22.29s) ✅ Filled 3 values in 'model' using ['brand', 'vehicletype', 'power', 'gearbox'] (2,162 remaining, took 9.41s) 🔁 Iteration 3 filled 112 total values 🌀 Iteration 4 starting... → Starting with 8,022 missing values in 'power' → Starting with 2,162 missing values in 'model' 🔁 Iteration 4 filled 0 total values 🏁 No further changes detected, stopping.
gc.collect()
0
df_app11 = fill_zero_power(df_app10,group_cols = ['brand','model','vehicletype','fueltype','year_bin','pc_bin'], threshold = 0.55)
display(df_app11[df_app11['power'] == 0])
df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','fueltype','registrationyear','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]
df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','fueltype','gearbox','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]
df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','fueltype','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]
df_app11 = fill_zero_power(df_app11,group_cols = ['brand','model','vehicletype','pc_bin'], threshold = 0.55)
df_app11[df_app11['power'] == 0]
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| 122 | 01/04/2016 16:06 | 800 | sedan | 1993.0 | manual | 0.0 | golf | 10000 | 9 | petrol | volkswagen | yes | 2016-01-04 | 0 | 65929 | 07/04/2016 11:17 | N | 1990s | 6 |
| 141 | 12/03/2016 17:47 | 2999 | wagon | 2001.0 | manual | 0.0 | 3er | 150000 | 7 | petrol | bmw | NaN | 2016-12-03 | 0 | 45891 | 07/04/2016 09:17 | N | 2000s | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339969 | 01/04/2016 12:38 | 1299 | coupe | 2013.0 | auto | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | no | 2016-01-04 | 0 | 48703 | 01/04/2016 12:38 | N | 2010_plus | 4 |
| 339973 | 05/04/2016 02:36 | 1500 | coupe | 2013.0 | manual | 0.0 | NaN | 5000 | 11 | petrol | sonstige_autos | no | 2016-05-04 | 0 | 27474 | 05/04/2016 08:46 | N | 2010_plus | 2 |
| 339974 | 09/03/2016 12:58 | 700 | coupe | 2013.0 | manual | 0.0 | NaN | 100000 | 1 | petrol | sonstige_autos | no | 2016-09-03 | 0 | 51570 | 05/04/2016 01:16 | N | 2010_plus | 5 |
| 339976 | 20/03/2016 23:49 | 3000 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 0 | gasoline | sonstige_autos | NaN | 2016-03-20 | 0 | 85072 | 23/03/2016 11:17 | N | 2010_plus | 8 |
| 339978 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
7949 rows × 19 columns
| datecrawled | price | vehicletype | registrationyear | gearbox | power | model | mileage | registrationmonth | fueltype | brand | notrepaired | datecreated | numberofpictures | postalcode | lastseen | registration_correction | year_bin | pc_bin | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 52 | 01/04/2016 11:56 | 1200 | coupe | 2001.0 | manual | 0.0 | astra | 150000 | 0 | petrol | opel | NaN | 2016-01-04 | 0 | 47249 | 07/04/2016 08:46 | N | 2000s | 4 |
| 68 | 23/03/2016 11:53 | 2400 | sedan | 2003.0 | manual | 0.0 | a4 | 150000 | 9 | gasoline | audi | NaN | 2016-03-23 | 0 | 40210 | 23/03/2016 11:53 | N | 2000s | 4 |
| 85 | 03/04/2016 03:57 | 350 | small | 1998.0 | manual | 0.0 | corsa | 150000 | 2 | petrol | opel | NaN | 2016-03-04 | 0 | 82110 | 03/04/2016 08:53 | N | 1990s | 8 |
| 122 | 01/04/2016 16:06 | 800 | sedan | 1993.0 | manual | 0.0 | golf | 10000 | 9 | petrol | volkswagen | yes | 2016-01-04 | 0 | 65929 | 07/04/2016 11:17 | N | 1990s | 6 |
| 141 | 12/03/2016 17:47 | 2999 | wagon | 2001.0 | manual | 0.0 | 3er | 150000 | 7 | petrol | bmw | NaN | 2016-12-03 | 0 | 45891 | 07/04/2016 09:17 | N | 2000s | 4 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 339969 | 01/04/2016 12:38 | 1299 | coupe | 2013.0 | auto | 0.0 | NaN | 5000 | 0 | petrol | sonstige_autos | no | 2016-01-04 | 0 | 48703 | 01/04/2016 12:38 | N | 2010_plus | 4 |
| 339973 | 05/04/2016 02:36 | 1500 | coupe | 2013.0 | manual | 0.0 | NaN | 5000 | 11 | petrol | sonstige_autos | no | 2016-05-04 | 0 | 27474 | 05/04/2016 08:46 | N | 2010_plus | 2 |
| 339974 | 09/03/2016 12:58 | 700 | coupe | 2013.0 | manual | 0.0 | NaN | 100000 | 1 | petrol | sonstige_autos | no | 2016-09-03 | 0 | 51570 | 05/04/2016 01:16 | N | 2010_plus | 5 |
| 339976 | 20/03/2016 23:49 | 3000 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 0 | gasoline | sonstige_autos | NaN | 2016-03-20 | 0 | 85072 | 23/03/2016 11:17 | N | 2010_plus | 8 |
| 339978 | 14/03/2016 19:40 | 5999 | coupe | 2013.0 | manual | 0.0 | NaN | 150000 | 12 | gasoline | sonstige_autos | NaN | 2016-03-14 | 0 | 89081 | 19/03/2016 07:47 | N | 2010_plus | 8 |
7929 rows × 19 columns
df_app12 = df_app11[df_app11['model'].notna()]
def fill_missing_power(df, repeat_until_change=True, threshold=0.7, verbose=True, max_iterations=10):
"""
Fill missing power values (where power == 0) using tiered group strategies.
Optimized version with better memory management and early stopping.
"""
df = df.copy()
def safe_mode(series):
"""Return mode if confident enough (>= threshold), else NaN."""
s = series.dropna()
if len(s) == 0:
return np.nan
counts = s.value_counts(normalize=True)
if len(counts) == 0:
return np.nan
top_val, top_freq = counts.index[0], counts.iloc[0]
return top_val if top_freq >= threshold else np.nan
def is_zero_condition(condition):
"""Heuristic to detect if condition checks for zero (lambda x: x == 0)."""
try:
test = condition(pd.Series([0, np.nan], dtype=object))
if isinstance(test, (bool, np.bool_)) and test:
return True
if hasattr(test, "__len__") and len(test) >= 1:
return bool(test.iloc[0] if hasattr(test, 'iloc') else test[0])
except Exception:
pass
return False
def make_key_tuple(row_vals):
"""Helper: convert list-like row values to a hashable tuple with None for NaN."""
return tuple(None if (isinstance(x, float) and np.isnan(x)) else x for x in row_vals)
def fill_column(target_col, fill_strategies, condition=lambda x: x.isna()):
total_filled = 0
zero_check = is_zero_condition(condition)
# Track initial state
if zero_check:
initial_missing = (df[target_col] == 0).sum()
else:
initial_missing = df[target_col].isna().sum()
if initial_missing == 0:
return 0
if verbose:
print(f" → Starting with {initial_missing:,} missing values in '{target_col}'")
for cols in fill_strategies:
# Check if there's still work to do
if zero_check:
current_missing = (df[target_col] == 0).sum()
else:
current_missing = df[target_col].isna().sum()
if current_missing == 0:
break
start_time = time.time()
try:
# Compute group modes using safe_mode
group_modes = (
df.groupby(cols, dropna=False)[target_col]
.apply(safe_mode)
.reset_index()
.rename(columns={target_col: 'fill_value'})
)
# Remove groups with no valid fill value
group_modes = group_modes[group_modes['fill_value'].notna()]
if len(group_modes) == 0:
continue
except Exception as e:
if verbose:
print(f"⚠️ Skipping strategy {cols} — groupby failed: {str(e)[:50]}")
continue
# Build mapping dict from group_modes
keys = group_modes[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
mapping = dict(zip(keys, group_modes['fill_value'].values))
# Compute fill_value per-row by mapping (keeps original row order)
row_keys = df[cols].apply(lambda r: make_key_tuple(r.tolist()), axis=1)
fill_series = row_keys.map(mapping)
# Create mask of rows that need filling AND have a candidate fill_value
mask_need = condition(df[target_col])
mask_candidate = fill_series.notna()
mask = mask_need & mask_candidate
# Count before
if zero_check:
before_missing = (df[target_col] == 0).sum()
else:
before_missing = df[target_col].isna().sum()
# Perform fill
if mask.any():
df.loc[mask, target_col] = fill_series.loc[mask].values
# Count after
if zero_check:
after_missing = (df[target_col] == 0).sum()
else:
after_missing = df[target_col].isna().sum()
filled_now = before_missing - after_missing
total_filled += int(filled_now)
if verbose and filled_now > 0:
elapsed = time.time() - start_time
print(f"✅ Filled {int(filled_now):,} values in '{target_col}' using {cols} ({after_missing:,} remaining, took {elapsed:.2f}s)")
return total_filled
iteration = 0
while iteration < max_iterations:
iteration += 1
total_filled = 0
if verbose:
print(f"\n🌀 Iteration {iteration} starting...")
# --- POWER ONLY ---
power_strategies = [
['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'],
['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'],
['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'],
['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'],
['brand', 'model', 'vehicletype', 'pc_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'],
['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'],
['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype', 'year_bin'],
['brand', 'model', 'vehicletype', 'registrationyear'],
['brand', 'model', 'vehicletype', 'gearbox'],
['brand', 'model', 'vehicletype'],
['brand', 'model', 'fueltype', 'vehicletype'],
['brand', 'model', 'fueltype', 'year_bin'],
['brand', 'model', 'fueltype', 'registrationyear'],
['brand', 'model', 'fueltype', 'gearbox'],
['brand', 'model', 'year_bin'],
['brand', 'model', 'registrationyear'],
['brand', 'model', 'pc_bin'],
['brand', 'model'],
['brand', 'vehicletype'],
['brand', 'year_bin'],
['brand', 'registrationyear'],
['brand', 'gearbox'],
['brand', 'pc_bin'],
['brand']
]
total_filled += fill_column('power', power_strategies, condition=lambda x: x == 0)
if verbose:
print(f"🔁 Iteration {iteration} filled {total_filled:,} total values")
if not repeat_until_change or total_filled == 0:
if verbose:
print("🏁 No further changes detected, stopping.")
break
return df
df_app13 = fill_missing_power(df_app12, threshold = 0.51)
🌀 Iteration 1 starting... → Starting with 7,635 missing values in 'power' ✅ Filled 10 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'year_bin', 'pc_bin'] (7,625 remaining, took 11.12s) ✅ Filled 25 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'registrationyear', 'pc_bin'] (7,600 remaining, took 29.80s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'gearbox', 'pc_bin'] (7,595 remaining, took 9.28s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'vehicletype', 'fueltype', 'pc_bin'] (7,594 remaining, took 7.19s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'vehicletype', 'pc_bin'] (7,592 remaining, took 4.97s) ✅ Filled 48 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'year_bin'] (7,544 remaining, took 4.30s) ✅ Filled 178 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'registrationyear'] (7,366 remaining, took 10.23s) ✅ Filled 7 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype', 'gearbox'] (7,359 remaining, took 3.96s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'vehicletype', 'registrationyear'] (7,346 remaining, took 7.59s) ✅ Filled 4 values in 'power' using ['brand', 'model', 'vehicletype', 'gearbox'] (7,342 remaining, took 3.17s) ✅ Filled 5 values in 'power' using ['brand', 'model', 'fueltype', 'vehicletype'] (7,337 remaining, took 3.34s) ✅ Filled 37 values in 'power' using ['brand', 'model', 'fueltype', 'year_bin'] (7,300 remaining, took 3.32s) ✅ Filled 60 values in 'power' using ['brand', 'model', 'fueltype', 'registrationyear'] (7,240 remaining, took 6.95s) ✅ Filled 1 values in 'power' using ['brand', 'model', 'fueltype', 'gearbox'] (7,239 remaining, took 3.12s) ✅ Filled 2 values in 'power' using ['brand', 'model', 'registrationyear'] (7,237 remaining, took 5.11s) ✅ Filled 13 values in 'power' using ['brand', 'model', 'pc_bin'] (7,224 remaining, took 3.33s) ✅ Filled 1 values in 'power' using ['brand', 'model'] (7,223 remaining, took 2.44s) ✅ Filled 1 values in 'power' using ['brand', 'vehicletype'] (7,222 remaining, took 2.41s) ✅ Filled 5 values in 'power' using ['brand', 'year_bin'] (7,217 remaining, took 2.39s) 🔁 Iteration 1 filled 418 total values 🌀 Iteration 2 starting... → Starting with 7,217 missing values in 'power' 🔁 Iteration 2 filled 0 total values 🏁 No further changes detected, stopping.
print(df_app13.memory_usage(deep=True).sum() / 1_000_000, "MB")
258.348599 MB
gc.collect()
0
del df_ap4
del df_app4
del df_app5
del df_app6
del df_app7
del df_app8
del df_app9
del df_app10
del df_app11
del df_app12
df_app14 = df_app13[df_app13['power'] != 0]
df_app15 = df_app14.drop(columns = ['year_bin','pc_bin', 'registration_correction'])
df_app15['notrepaired'] = df_app15['notrepaired'].fillna('unknown')
df_app15.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 330602 entries, 0 to 339980 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled 330602 non-null object 1 price 330602 non-null int64 2 vehicletype 330602 non-null object 3 registrationyear 330602 non-null float64 4 gearbox 330602 non-null object 5 power 330602 non-null float64 6 model 330602 non-null object 7 mileage 330602 non-null int64 8 registrationmonth 330602 non-null int64 9 fueltype 330600 non-null object 10 brand 330602 non-null object 11 notrepaired 330602 non-null object 12 datecreated 330602 non-null datetime64[ns] 13 numberofpictures 330602 non-null int64 14 postalcode 330602 non-null int64 15 lastseen 330602 non-null object dtypes: datetime64[ns](1), float64(2), int64(5), object(8) memory usage: 42.9+ MB
df_app15.to_pickle('checkpoint_02.pkl')
petrol = (df_app15['fueltype'] == 'gasoline')
df_app15.loc[petrol,['fueltype']] = 'petrol'
petrol = (df_app15['fueltype'].isna())
df_app15.loc[petrol,['fueltype']] = 'petrol'
del petrol
DType Clean Up¶
# 1. Fix datetime columns
date_cols = ['datecrawled', 'lastseen']
for col in date_cols:
df_app15[col] = pd.to_datetime(df_app15[col], errors='coerce')
# 2. Convert numeric columns to efficient types
# registrationyear & power should not be floats
df_app15['registrationyear'] = df_app15['registrationyear'].astype('int')
df_app15['power'] = df_app15['power'].astype('int')
# 3. Clean up memory
gc.collect()
print("Final memory usage:", df_app15.memory_usage(deep=True).sum() / 1_000_000, "MB")
print(df_app15.dtypes)
Final memory usage: 152.521751 MB datecrawled datetime64[ns] price int64 vehicletype object registrationyear int64 gearbox object power int64 model object mileage int64 registrationmonth int64 fueltype object brand object notrepaired object datecreated datetime64[ns] numberofpictures int64 postalcode int64 lastseen datetime64[ns] dtype: object
df_app15['datecrawled_year'] = df_app15['datecrawled'].dt.year
df_app15['datecrawled_month'] = df_app15['datecrawled'].dt.month.astype('object')
df_app15['datecreated_year'] = df_app15['datecreated'].dt.year
df_app15['datecreated_month'] = df_app15['datecreated'].dt.month.astype('object')
df_app15['lastseen_year'] = df_app15['lastseen'].dt.year
df_app15['lastseen_month'] = df_app15['lastseen'].dt.month.astype('object')
df_app15['postalcode'] = df_app15['postalcode'].astype('object')
df_app15['registrationmonth'] = df_app15['registrationmonth'].astype('object')
df_app15.insert(df_app15.columns.get_loc("datecrawled"), "datecrawled_month", df_app15.pop("datecrawled_month"))
df_app15.insert(df_app15.columns.get_loc("datecrawled") + 1, "datecrawled_year", df_app15.pop("datecrawled_year"))
df_app15.insert(df_app15.columns.get_loc("datecreated"), "datecreated_month", df_app15.pop("datecreated_month"))
df_app15.insert(df_app15.columns.get_loc("datecreated") + 1, "datecreated_year", df_app15.pop("datecreated_year"))
df_app15.insert(df_app15.columns.get_loc("lastseen"), "lastseen_month", df_app15.pop("lastseen_month"))
df_app15.insert(df_app15.columns.get_loc("lastseen") + 1, "lastseen_year", df_app15.pop("lastseen_year"))
df_app15 = df_app15.drop(columns=['datecrawled', 'datecreated', 'lastseen'])
df_app15.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 330602 entries, 0 to 339980 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled_month 330602 non-null object 1 datecrawled_year 330602 non-null int64 2 price 330602 non-null int64 3 vehicletype 330602 non-null object 4 registrationyear 330602 non-null int64 5 gearbox 330602 non-null object 6 power 330602 non-null int64 7 model 330602 non-null object 8 mileage 330602 non-null int64 9 registrationmonth 330602 non-null object 10 fueltype 330602 non-null object 11 brand 330602 non-null object 12 notrepaired 330602 non-null object 13 datecreated_month 330602 non-null object 14 datecreated_year 330602 non-null int64 15 numberofpictures 330602 non-null int64 16 postalcode 330602 non-null object 17 lastseen_month 330602 non-null object 18 lastseen_year 330602 non-null int64 dtypes: int64(8), object(11) memory usage: 50.4+ MB
DataFrame Comparison¶
gc.collect()
43
coupe = df[df['vehicletype'] == 'coupe']
suv = df[df['vehicletype'] == 'suv']
small = df[df['vehicletype'] == 'small']
sedan = df[df['vehicletype'] == 'sedan']
convertible = df[df['vehicletype'] == 'convertible']
bus = df[df['vehicletype'] == 'bus']
wagon = df[df['vehicletype'] == 'wagon']
ncoupe = df_app15[df_app15['vehicletype'] == 'coupe']
nsuv = df_app15[df_app15['vehicletype'] == 'suv']
nsmall = df_app15[df_app15['vehicletype'] == 'small']
nsedan = df_app15[df_app15['vehicletype'] == 'sedan']
nconvertible = df_app15[df_app15['vehicletype'] == 'convertible']
nbus = df_app15[df_app15['vehicletype'] == 'bus']
nwagon = df_app15[df_app15['vehicletype'] == 'wagon']
coupe['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Coupes per Brand: Before Data Cleaning')
plt.show()
ncoupe['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Coupes per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=coupe, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Coupe Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=ncoupe, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Coupe Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del coupe
del ncoupe
suv['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of SUVs per Brand: Before Data Cleaning')
plt.show()
nsuv['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of SUVs per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=suv, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of SUVs Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nsuv, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of SUVs Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del suv
del nsuv
small['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Small per Brand: Before Data Cleaning')
plt.show()
nsmall['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Small per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=small, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Small Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nsmall, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Small Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del small
del nsmall
sedan['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Sedan per Brand: Before Data Cleaning')
plt.show()
nsedan['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Sedan per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=sedan, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Sedan Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nsedan, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Sedan Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del sedan
del nsedan
convertible['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Convertibles per Brand: Before Data Cleaning')
plt.show()
nconvertible['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Convertibless per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=convertible, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Convertibles Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nconvertible, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Convertible Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del convertible
del nconvertible
bus['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Buses per Brand: Before Data Cleaning')
plt.show()
nbus['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Buses per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=bus, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Bus Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nbus, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Bus Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
del bus
del nbus
wagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand: Before Data Cleaning')
plt.show()
nwagon['brand'].value_counts().plot(kind='bar', figsize=(10,5), title='Number of Wagons per Brand: After Data Cleaning')
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=wagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand: Before Data Cleaning')
plt.grid()
plt.show()
plt.figure(figsize=(14,16))
sns.boxplot(data=nwagon, x='price', y='brand')
plt.xticks(rotation=90)
plt.title('Price Distribution of Wagon Models per Brand: After Data Cleaning')
plt.grid()
plt.show()
df['price'].hist(bins=20)
plt.show()
df_app15['price'].hist(bins=20)
plt.show()
del df_app2
del df_app13
del df_app14
del df_car
del df_app3
del df_vetype
del df_reg
del df_model_x
del df_vt
del df_model
del df_app
del df_ft
del wagon
del nwagon
del vt_power
del remainder_models
del ft
del bora_to_jetta
del jetta16
del captiva
del matiz68
del matiz52
del matiz67
del re_1
del passat
del passat1
del passat2
del passat3
del passat4
del passat5
del passat6
del golf
del passat140
del golf90
del passat90
del golf75
del golf7502
del passat105
del passat131
del passat116
del passat150
del passat115
del passat170
del golf110
del golf60
del polo60
del passat125
del passat100
del passat174
del passat130
del passat120
del audi75
del bmw75
del opelsedan60
del opel9160
del opelastra
del opelcorsa
del astraopel
del opelcombo
gc.collect()
66895
del civic75
del mini75
del nissan60
del seat60
Model Training¶
import sys
def show_memory_usage():
vars_list = []
for name, obj in globals().items():
if not name.startswith('_'):
size_mb = sys.getsizeof(obj) / (1024**2)
if size_mb > 1: # Only show objects > 1MB
vars_list.append((name, size_mb, type(obj).__name__))
vars_list.sort(key=lambda x: x[1], reverse=True)
print("\n🔍 Memory Usage:")
for name, size, dtype in vars_list[:10]:
print(f" {name}: {size:.2f} MB ({dtype})")
# Use it throughout your notebook
show_memory_usage()
🔍 Memory Usage: df: 222.44 MB (DataFrame) df1: 213.99 MB (DataFrame) df_newest: 211.90 MB (DataFrame) df_newer: 211.77 MB (DataFrame) df_new: 211.75 MB (DataFrame) df_app15: 197.05 MB (DataFrame) civic75: 11.10 MB (Series) mini75: 11.10 MB (Series) nissan60: 11.10 MB (Series) seat60: 11.10 MB (Series)
data = df_app15.copy()
del df_app15
gc.collect()
0
# If the kernel crashes:
# import libraries (Go to the top - press ctrl+F and type libraries to get there faster - run the libraries)
# data = pd.read_pickle('checkpoint_03.pkl') <-- copy this on a new line right below, run it
# This is a checkpoint to start off with the data DF
data.to_pickle('checkpoint_03.pkl')
data = pd.read_pickle('checkpoint_03.pkl')
data.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 330602 entries, 0 to 339980 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 datecrawled_month 330602 non-null object 1 datecrawled_year 330602 non-null int64 2 price 330602 non-null int64 3 vehicletype 330602 non-null object 4 registrationyear 330602 non-null int64 5 gearbox 330602 non-null object 6 power 330602 non-null int64 7 model 330602 non-null object 8 mileage 330602 non-null int64 9 registrationmonth 330602 non-null object 10 fueltype 330602 non-null object 11 brand 330602 non-null object 12 notrepaired 330602 non-null object 13 datecreated_month 330602 non-null object 14 datecreated_year 330602 non-null int64 15 numberofpictures 330602 non-null int64 16 postalcode 330602 non-null object 17 lastseen_month 330602 non-null object 18 lastseen_year 330602 non-null int64 dtypes: int64(8), object(11) memory usage: 50.4+ MB
Train/Validate Split¶
features = data.drop('price', axis=1)
target = data['price']
features_train, features_valid, target_train, target_valid = train_test_split(
features, target,
test_size=0.25,
random_state=12345
)
# Identify categorical columns
cat_cols = features_train.select_dtypes(include=['object','category']).columns
num_cols = features_train.select_dtypes(exclude=['object','category']).columns
features_train = features_train.copy()
features_valid = features_valid.copy()
features_train.loc[:, cat_cols] = features_train[cat_cols].astype(str)
features_valid.loc[:, cat_cols] = features_valid[cat_cols].astype(str)
def evaluate_model(name, model, features_train, target_train, features_valid, target_valid, cat_features=None):
print(f"\nTraining {name}...")
start_train = time.time()
if cat_features is not None:
model.fit(features_train, target_train, cat_features=cat_features)
else:
model.fit(features_train, target_train)
train_time = time.time() - start_train
start_pred = time.time()
preds = model.predict(features_valid)
pred_time = time.time() - start_pred
rmse = mean_squared_error(target_valid, preds, squared=False)
print(f"{name}: RMSE={rmse:.3f}, TrainTime={train_time:.2f}s, PredTime={pred_time:.4f}s")
return {
'Model': name,
'RMSE': rmse,
'Train_Time': train_time,
'Predict_Time': pred_time
}
gc.collect()
0
ohe_processor = ColumnTransformer(
transformers=[
('cat', OneHotEncoder(handle_unknown='ignore', dtype = int), cat_cols)
],
remainder='passthrough'
)
lr_model = Pipeline([
('ohe', ohe_processor),
('lr', LinearRegression())
])
results = []
results.append(
evaluate_model('Linear Regression Model', lr_model, features_train, target_train, features_valid, target_valid)
)
Training Linear Regression Model... Linear Regression Model: RMSE=2864.931, TrainTime=1.21s, PredTime=0.3178s
gc.collect()
52
# DecisionTree
dt_model = Pipeline([
('ohe', ohe_processor),
('dt', DecisionTreeRegressor(
max_depth=20,
min_samples_leaf=4,
random_state=12345
))
])
results.append(
evaluate_model('Decision Tree Model', dt_model, features_train, target_train, features_valid, target_valid)
)
Training Decision Tree Model... Decision Tree Model: RMSE=1904.375, TrainTime=27.29s, PredTime=0.2165s
gc.collect()
52
# Random Forest
rf_model = Pipeline([
('ohe', ohe_processor),
('rf', RandomForestRegressor(
n_estimators=100,
max_depth=20,
random_state=12345,
n_jobs=-1
))
])
results.append(
evaluate_model('Random Forest', rf_model, features_train, target_train, features_valid, target_valid)
)
Training Random Forest... Random Forest: RMSE=1686.247, TrainTime=1160.16s, PredTime=0.7960s
gc.collect()
80
# results_df = pd.read_pickle('checkpoint_04b.pkl')
results_df = pd.DataFrame(results)
results_df.to_pickle('checkpoint_04a.pkl')
# CATBOOST
cat_features = [features_train.columns.get_loc(c) for c in cat_cols]
cat_model = CatBoostRegressor(
depth=8,
learning_rate=0.1,
iterations=500,
loss_function='RMSE',
verbose=False,
random_seed=12345
)
results.append(
evaluate_model(
'CatBoost',
cat_model,
features_train,
target_train,
features_valid,
target_valid,
cat_features=cat_features
)
)
Training CatBoost... CatBoost: RMSE=1636.975, TrainTime=243.67s, PredTime=0.6921s
cat_cols = list(cat_cols)
for col in cat_cols:
features_train[col] = features_train[col].astype("category")
features_valid[col] = features_valid[col].astype("category")
# XGBOOST
xgb_model = Pipeline(steps=[
('preprocess', ohe_processor),
('model', XGBRegressor(
n_estimators=400,
learning_rate=0.05,
max_depth=8,
subsample=0.8,
colsample_bytree=0.8,
random_state=12345,
objective='reg:squarederror',
n_jobs=-1
))
])
results.append(
evaluate_model(
"XGBoost",
xgb_model,
features_train, target_train, features_valid, target_valid)
)
Training XGBoost... XGBoost: RMSE=1655.304, TrainTime=196.66s, PredTime=1.1021s
# LightGBM datasets
lgb_train = lgb.Dataset(
features_train,
label=target_train
)
lgb_valid = lgb.Dataset(
features_valid,
label=target_valid,
reference=lgb_train
)
# LightGBM Set 1
params_set1 = {
'objective': 'regression',
'metric': 'rmse',
'num_leaves': 31,
'learning_rate': 0.05,
'verbose': -1
}
print("\nTraining LightGBM (Set 1)...")
start1 = time.time()
lgb_model1 = lgb.train(
params_set1,
lgb_train,
valid_sets=[lgb_valid],
num_boost_round=300,
callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
train_time1 = time.time() - start1
start_pred1 = time.time()
preds1 = lgb_model1.predict(features_valid)
pred_time1 = time.time() - start_pred1
rmse1 = mean_squared_error(target_valid, preds1, squared=False)
results.append({
'Model': 'LightGBM Set 1',
'RMSE': rmse1,
'Boosting_Rounds': lgb_model1.best_iteration,
'Train_Time': train_time1,
'Predict_Time': pred_time1
})
# LightGBM Set 2
params_set2 = {
'objective': 'regression',
'metric': 'rmse',
'num_leaves': 64,
'learning_rate': 0.1,
'verbose': -1
}
print("\nTraining LightGBM (Set 2)...")
start2 = time.time()
lgb_model2 = lgb.train(
params_set2,
lgb_train,
valid_sets=[lgb_valid],
num_boost_round=500,
callbacks=[lgb.early_stopping(stopping_rounds=50)]
)
train_time2 = time.time() - start2
start_pred2 = time.time()
preds2 = lgb_model2.predict(features_valid)
pred_time2 = time.time() - start_pred2
rmse2 = mean_squared_error(target_valid, preds2, squared=False)
results.append({
'Model': 'LightGBM Set 2',
'RMSE': rmse2,
'Boosting_Rounds': lgb_model2.best_iteration,
'Train_Time': train_time2,
'Predict_Time': pred_time2
})
print(f"LightGBM Set 1: RMSE={rmse1:.3f}, TrainTime={train_time1:.2f}, PredTime={pred_time1:.2f}")
print(f"LightGBM Set 2: RMSE={rmse2:.3f}TrainTime={train_time2:.2f}, PredTime={pred_time2:.2f}")
Training LightGBM (Set 1)...
/.venv/lib/python3.9/site-packages/lightgbm/basic.py:1780: UserWarning: Overriding the parameters from Reference Dataset.
_log_warning('Overriding the parameters from Reference Dataset.')
/.venv/lib/python3.9/site-packages/lightgbm/basic.py:1513: UserWarning: categorical_column in param dict is overridden.
_log_warning(f'{cat_alias} in param dict is overridden.')
Training until validation scores don't improve for 50 rounds Did not meet early stopping. Best iteration is: [300] valid_0's rmse: 1684.32 Training LightGBM (Set 2)... Training until validation scores don't improve for 50 rounds Did not meet early stopping. Best iteration is: [500] valid_0's rmse: 1627.28 LightGBM Set 1: RMSE=1684.316, TrainTime=28.57, PredTime=1.42 LightGBM Set 2: RMSE=1627.283TrainTime=57.51, PredTime=4.37
Model analysis¶
# RESULTS TABLE
results_df = pd.DataFrame(results)
results_df.sort_values(by='RMSE', inplace=True)
results_df.reset_index(drop=True, inplace=True)
print("\n\nFINAL MODEL COMPARISON:")
print(results_df.to_string())
FINAL MODEL COMPARISON:
Model RMSE Train_Time Predict_Time Boosting_Rounds
0 LightGBM Set 2 1627.283377 57.508679 4.374746 500.0
1 CatBoost 1636.975409 243.666996 0.692125 NaN
2 XGBoost 1655.304347 196.659249 1.102127 NaN
3 LightGBM Set 1 1684.315965 28.574041 1.416330 300.0
4 Random Forest 1686.246985 1160.163886 0.796043 NaN
5 Decision Tree Model 1904.374673 27.287978 0.216526 NaN
6 Linear Regression Model 2864.931041 1.208937 0.317789 NaN
Final Conclusion¶
This project successfully developed and evaluated multiple machine learning models to predict used car prices for Rusty Bargain's mobile application. The analysis focused on three critical metrics: prediction quality (RMSE), prediction speed, and training time.
Key Findings¶
Best Overall Model: LightGBM Set 2
- Achieved the lowest RMSE of 1,627.28 euros, representing the most accurate predictions
- Demonstrated reasonable training time (approximately 58 seconds) and fast prediction speed (approximately 4 seconds)
- Utilized 500 boosting rounds
Model Performance Summary:
- Top performers (RMSE < 1,700): LightGBM Set 2, CatBoost, and XGBoost all delivered strong predictive accuracy
- CatBoost offered the fastest prediction time (0.69 seconds) while maintaining excellent accuracy (1,636.98 RMSE)
- Random Forest provided competitive accuracy (1,686.25 RMSE) but required significantly longer training time (1,160 seconds)
- Linear Regression served as an effective sanity check with RMSE of 2,864.93, confirming that gradient boosting methods substantially outperformed the baseline
Trade-offs Analysis¶
For Production Deployment:
- If prediction speed is critical: CatBoost is recommended with sub-second prediction time and only marginally lower accuracy than LightGBM
- If accuracy is paramount: LightGBM Set 2 provides the best predictions while maintaining reasonable computational requirements
- For balanced performance: XGBoost offers strong accuracy with moderate training and prediction times
Technical Approach¶
The project successfully:
- Cleaned and preprocessed 330,000+ records with extensive missing value imputation using hierarchical grouping strategies
- Implemented proper categorical encoding (label encoding for LightGBM/CatBoost, one-hot encoding for XGBoost)
- Validated that gradient boosting methods significantly outperformed traditional algorithms
Recommendation¶
For Rusty Bargain's mobile application, I recommend deploying LightGBM Set 2 as the primary model, with CatBoost as a secondary option if real-time prediction speed becomes a bottleneck. Both models achieve RMSE under 1,650 euros, meaning predictions are typically within this margin of the actual price—acceptable accuracy for a used car valuation tool.
The gradient boosting approaches demonstrated clear superiority over simpler methods, justifying their computational overhead for this business application where prediction accuracy directly impacts customer trust and satisfaction.
Checklist¶
Type 'x' to check. Then press Shift+Enter.
- Jupyter Notebook is open
- Code is error free
- The cells with the code have been arranged in order of execution
- The data has been downloaded and prepared
- The models have been trained
- The analysis of speed and quality of the models has been performed